# Replacement mpn/generic/mul.c

Torbjorn Granlund tg at gmplib.org
Fri Dec 18 20:32:59 CET 2009

```bodrato at mail.dm.unipi.it writes:

I can understand why it calls mpn_mul when the final un < vn, but the else
branch should look exactly like the else branch of the previous "if (2 *
un >= 5 * vn)", with ws replacing prodp; saving a new allocation of the
scratch.

This is a flaw.

The idea was that the toomX3 block should be used for larger operands.
The  BELOW_THRESHOLD (vn, MUL_TOOM44_THRESHOLD) should go away.

Then the mpn_mul call makes more sense--we can have operands in the
toom44 range.

Do you agree?

About the scratch, in both the toomX2 and the toomX3 variants you
#define ITCH ((un + vn) * 4 + 100)
but assume un=70*vn... you are allocating a lot of space, but you will
call all the lower-level toom with at most un=3*vn (or 5/2 vn for toomX3)

Maybe
#define ITCH ((3*vn + vn) * 4 + 100)
and
#define ITCH ((5*vn + 2*vn) * 2 + 100)
are enough?

Yes, the ITCH definition is not good (it was inherited from the old code).

might need as much 15vn?  You're suggesting 14vn?

> * The toomX4 code now consists of just a call to toom44_mul.  The looping
> is gone.

I have some doubt about using FFT when the operand are very unbalanced.

Let us assume FFT threshold is 2000. If Toom-4 threshold is less than 500,
then we will compute a 1500x500 product with FFT. This may be a good idea.

But what about greater unbalancements? 3000x500, 5000x500.... let us
examine the general (500k)x500 case.

Complexity of the school-book algorithm (or any toomXn iteration) grows
linearly with the growing k. FFT grows at least with O(k log(k)).
Iteration will eventually will be faster than FFT...
Even for (un+vn) above the FFT line, it is maybe wise to iterate if the
product is very unbalanced e.g. for un>10vn. The iterated algorithm can be
FFT itself...

I completely agree.

Question is: What condition should we add as a disjunction for entering
the toomX3 block?

Your analysis implies that is should be a function if both un and vn, of
course...

Thanks much for your quick feedback!  I have attached an updated
version.  The main change is that the "X3" and "X4" blocks are now
entered with a common condition, a condition that just excludes FFT
operands.  They are then separated with a cimple MUL_TOOM44_THRESHOLD +
TOOM44_OK condition.

(I have used coverage analysis to make sure all code is executed.)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul.c
Type: application/octet-stream
Size: 9579 bytes
Desc: not available