For project "Faster multiplication": faster Toom-3, proposed Toom-4

Thu Oct 26 11:49:16 CEST 2006

Dear Developers,

We have just published a preprint about optimal Toom-Cook interpolation.

The most interesting result for you should be a faster Toom 3-way
multiplication. The interpolation used by GMP requires 8 sums, 1
division and 4 shift, the new proposed one requires 1 shift
less. Detailed description follows.

We could also say that the new proposed method, for 3-way
interpolation, is optimal, because we found it by a (pruned)
exhaustive search. Anyway we should doubly check our software and our
model before claiming a proof.

We hope this information will be useful to you, and we would be honoured
if you will take the time to glance to our preprint (you can find it in
http://bodrato.it/papers/#CIVV2006 ).

We will be glad to answer any clarification request of yours.

Best regards,
Marco Bodrato and Alberto Zanoni

------------------------------------------------------------
>From comments in gmp-4.2.1/mpn/generic/mul_n.c, function mpn_toom3_mul_n:

/* Implemented algorithm is the following:
[...]
 * 2. v0   <- a0*b0
 *    v1   <- (a0+a1+a2)*(b0+b1+b2)
 *    v2   <- (a0+2*a1+4*a2)*(b0+2*b1+4*b2)
 *    vm1  <- (a0-a1+a2)*(b0-b1+b2)
 *    vinf <- a2*b2
 * -- Inversion starts here! --
 *    t1   <- ((2*vm1+v2)/3+v0)/2-2*vinf
 *    t2   <- (v1+vm1)/2
 * 3. result is c0+c1*t+c2*t^2+c3*t^3+c4*t^4 where
 *    c0   <- v0
 *    c1   <- v1 - t1
 *    c2   <- t2 - v0 - vinf
 *    c3   <- t1 - t2
 *    c4   <- vinf

This requires:
	8(sum/diff) + 1(divby3) + 2(divby2) + 2 (mulby2)

Our proposed algorithm starts from the same evaluation
v0, v1, vm1, v2, vinf, then inverts by using the following steps:

 * 2i. Inversion: every result is positive, all divisions are exact
 *    t2   <- (v2 - vm1)/3            %%  v2 -= vm1; v2 /= 3;
 *    tm1  <- (v1 - vm1)/2            %% vm1 *= -1 ; vm1 += v1; vm1>>= 1;
 *    t1   <- (v1 - v0)               %%  v1 -= v0 ;
 *    t2   <- (t2 - t1)/2             %%  v2 -= v1 ; v2 >>= 1;
 *    t1   <- (t1 - tm1 - vinf)       %%  v1 -= vm1; v1 -= vinf;
 *    t2   <- (t2 - 2*vinf)           %%  v2 -= 2*vinf;
 *    tm1  <- (tm1 - t2)              %% vm1 -= v2;
 * 3. result is c0+c1*t+c2*t^2+c3*t^3+c4*t^4 where
 *    c0   <- v0
 *    c1   <- tm1
 *    c2   <- t1
 *    c3   <- t2
 *    c4   <- vinf

New algorithm requires:
	8(sum/diff) + 1(divby3) + 2(divby2) + 1 (mulby2)

Difference is 1 shift. Moreover the algorithm can be easily
implemented without any need for temporary variables, defining t2=v2,
t1=v1, tm1=vm1.