Toom-Cook 4x3 and 5x2 multiplication: Toom3.5

Thu Nov 23 09:03:55 CET 2006

Dear Paul,

On Wed, 22 Nov 2006 22:34:00 +0100, Paul Zimmermann wrote
> I have implemented at the mpn level the Toom-Cook 2.5 algorithm suggested by
> Marco in http://gmplib.org/list-archives/gmp-devel/2006-November/000675.html

Great! On my side, I implemented Toom-Cook 3.5 at the mpz level. You can find
the code and some graphs together with Toom5, Toom6, and Toom7 on my page
http://bodrato.it/software/toom.html#TC5 .

> The code is included below. On a Pentium M with gmp-4.2.1, it it faster than
> mpn_mul up from n=11 (i.e. a 33 x 22 product), and up to the FFT threshold

On mine it is faster from n=9 (27x18)!
$ ./tc2.5 9 1000000
mpn_mul      took 3264ms
mpn_toom_3_2 took 3104ms (0.95)

> (i.e. 4992 x 3328). On this large range, it saves up to about 20% with respect
> to mpn_mul.

Toom3.5 is faster than mpz_mul from 120x90 product, and up to 4500x3375.
Saving up to 35%. Then, Toom7 is still 5% faster than FFT, up to 6500x4875.
This Toom3.5 implementation is that fast also because it recurses on Toom7
for balanced sub-multiplications.

Here are some timings:

$ ./tc34567 120 20000
Testing sizes a 120 , b 90
mpz_mul took 928ms
mpz_tc3 took 920ms
toom3.5 took 884ms (*)
mpz_tc5 took 1112ms
mpz_tc6 took 1284ms
mpz_tc7 took 1448ms
$ ./tc34567 500 2000
Testing sizes a 500 , b 375
mpz_mul took 820ms
mpz_tc3 took 860ms
toom3.5 took 660ms (*)
mpz_tc5 took 712ms
mpz_tc6 took 752ms
mpz_tc7 took 744ms
$ ./tc34567 2500 500
Testing sizes a 2500 , b 1875
mpz_mul took 2496ms
mpz_tc3 took 2280ms
toom3.5 took 1676ms (*)
mpz_tc5 took 1740ms
mpz_tc6 took 1740ms
mpz_tc7 took 1700ms
$ ./tc34567 3500 500
Testing sizes a 3500 , b 2625
mpz_mul took 3904ms
mpz_tc3 took 3720ms
mpz_3.5 took 2532ms (*)
mpz_tc5 took 2917ms
mpz_tc6 took 2716ms
mpz_tc7 took 2780ms
$ ./tc34567 4500 500
Testing sizes a 4500 , b 3375
mpz_mul took 4020ms
mpz_tc3 took 5760ms
toom3.5 took 4020ms
mpz_tc5 took 4156ms
mpz_tc6 took 3624ms (*)
mpz_tc7 took 3664ms
$ tc34567 && ./tc34567 5500 500
Testing sizes a 5500 , b 4125
mpz_mul took 5372ms
mpz_tc3 took 7596ms
toom3.5 took 5097ms
mpz_tc5 took 5552ms
mpz_tc6 took 5269ms
mpz_tc7 took 4880ms (*)
$ tc34567 && ./tc34567 6500 500
Testing sizes a 6500 , b 4875
mpz_mul took 6272ms
mpz_tc3 took 10033ms
toom3.5 took 6480ms
mpz_tc5 took 6793ms
mpz_tc6 took 6548ms
mpz_tc7 took 6037ms (*)

Notes:
- Toom3.5 will be faster than Toom4, for any size, and this is obvious...
- mpz_tc3 is sometimes (in the 1600-3600 range) faster than mpz_mul.
This may indicate that padding can give worst result than management of
different sizes.

Regards,
Marco

PS: after a good sleep... I changed the code again, now mpz_tc3_5 performs
all we called Toom3.5 in the paper. I mean, it implements both 4x3 and 5x2
computation, chosing at run-time. For the recursive multiply A(Inf)*B(Inf)
it recurse to itself, because the initial factors could be not exactly
5x2 or 4x3, so this product can be unbalanced again!

I made some very rough graph, for comparisons, and this single unbalanced
algorithm perform quite well if compared to balanced Toom nxn, for many
ratios (tested 4x3, 5x3 and 2x1).