Replacement mpn/generic/mul.c

Fri Dec 18 16:45:31 CET 2009

Ciao,

> I have worked on a better mpn_mul, and here is a preliminary result.

Looks an improvement to my eyes.

> * The cutoff lines in the toomX2 code are now exactly between the ideal
> lines
>   of the surrounding algorithms.  (Is this close-enough to optimal?)

I do not know where the optimal line lies, but this approximation is
reasonable.

> * The toomX3 code now uses a structure similar to the one of toomX2,
> except
>   that it loops longer in the unbalanced case.  The result is that the
>   remaining area might have un < vn.
>
> * The toomX3 code is used for the largest non-FFT unbalanced operands.  It
>   therefore calls mpn_mul recursively for certain cases.

I can understand why it calls mpn_mul when the final un < vn, but the else
branch should look exactly like the else branch of the previous "if (2 *
un >= 5 * vn)", with ws replacing prodp; saving a new allocation of the
scratch.

About the scratch, in both the toomX2 and the toomX3 variants you
#define ITCH ((un + vn) * 4 + 100)
but assume un=70*vn... you are allocating a lot of space, but you will
call all the lower-level toom with at most un=3*vn (or 5/2 vn for toomX3)

Maybe
#define ITCH ((3*vn + vn) * 4 + 100)
and
#define ITCH ((5*vn + 2*vn) * 2 + 100)
are enough?

> * The toomX4 code now consists of just a call to toom44_mul.  The looping
> is gone.

I have some doubt about using FFT when the operand are very unbalanced.

Let us assume FFT threshold is 2000. If Toom-4 threshold is less than 500,
then we will compute a 1500x500 product with FFT. This may be a good idea.

But what about greater unbalancements? 3000x500, 5000x500.... let us
examine the general (500k)x500 case.

Complexity of the school-book algorithm (or any toomXn iteration) grows
linearly with the growing k. FFT grows at least with O(k log(k)).
Iteration will eventually will be faster than FFT...
Even for (un+vn) above the FFT line, it is maybe wise to iterate if the
product is very unbalanced e.g. for un>10vn. The iterated algorithm can be
FFT itself...

Best regards,
Marco

-- 
http://bodrato.it/papers/