Replacement mpn/generic/mul.c
bodrato at mail.dm.unipi.it
bodrato at mail.dm.unipi.it
Fri Dec 18 16:45:31 CET 2009
Ciao,
> I have worked on a better mpn_mul, and here is a preliminary result.
Looks an improvement to my eyes.
> * The cutoff lines in the toomX2 code are now exactly between the ideal
> lines
> of the surrounding algorithms. (Is this close-enough to optimal?)
I do not know where the optimal line lies, but this approximation is
reasonable.
> * The toomX3 code now uses a structure similar to the one of toomX2,
> except
> that it loops longer in the unbalanced case. The result is that the
> remaining area might have un < vn.
>
> * The toomX3 code is used for the largest non-FFT unbalanced operands. It
> therefore calls mpn_mul recursively for certain cases.
I can understand why it calls mpn_mul when the final un < vn, but the else
branch should look exactly like the else branch of the previous "if (2 *
un >= 5 * vn)", with ws replacing prodp; saving a new allocation of the
scratch.
About the scratch, in both the toomX2 and the toomX3 variants you
#define ITCH ((un + vn) * 4 + 100)
but assume un=70*vn... you are allocating a lot of space, but you will
call all the lower-level toom with at most un=3*vn (or 5/2 vn for toomX3)
Maybe
#define ITCH ((3*vn + vn) * 4 + 100)
and
#define ITCH ((5*vn + 2*vn) * 2 + 100)
are enough?
> * The toomX4 code now consists of just a call to toom44_mul. The looping
> is gone.
I have some doubt about using FFT when the operand are very unbalanced.
Let us assume FFT threshold is 2000. If Toom-4 threshold is less than 500,
then we will compute a 1500x500 product with FFT. This may be a good idea.
But what about greater unbalancements? 3000x500, 5000x500.... let us
examine the general (500k)x500 case.
Complexity of the school-book algorithm (or any toomXn iteration) grows
linearly with the growing k. FFT grows at least with O(k log(k)).
Iteration will eventually will be faster than FFT...
Even for (un+vn) above the FFT line, it is maybe wise to iterate if the
product is very unbalanced e.g. for un>10vn. The iterated algorithm can be
FFT itself...
Best regards,
Marco
--
http://bodrato.it/papers/
More information about the gmp-devel
mailing list