Toom-Cook 4x3 and 5x2 multiplication: Toom3.5

Marco Bodrato bodrato at
Thu Nov 23 09:03:55 CET 2006

Dear Paul,

On Wed, 22 Nov 2006 22:34:00 +0100, Paul Zimmermann wrote
> I have implemented at the mpn level the Toom-Cook 2.5 algorithm suggested by
> Marco in

Great! On my side, I implemented Toom-Cook 3.5 at the mpz level. You can find
the code and some graphs together with Toom5, Toom6, and Toom7 on my page .

> The code is included below. On a Pentium M with gmp-4.2.1, it it faster than
> mpn_mul up from n=11 (i.e. a 33 x 22 product), and up to the FFT threshold

On mine it is faster from n=9 (27x18)!
$ ./tc2.5 9 1000000
mpn_mul      took 3264ms
mpn_toom_3_2 took 3104ms (0.95)

> (i.e. 4992 x 3328). On this large range, it saves up to about 20% with respect
> to mpn_mul.

Toom3.5 is faster than mpz_mul from 120x90 product, and up to 4500x3375.
Saving up to 35%. Then, Toom7 is still 5% faster than FFT, up to 6500x4875.
This Toom3.5 implementation is that fast also because it recurses on Toom7
for balanced sub-multiplications.

Here are some timings:

$ ./tc34567 120 20000
Testing sizes a 120 , b 90
mpz_mul took 928ms
mpz_tc3 took 920ms
toom3.5 took 884ms (*)
mpz_tc5 took 1112ms
mpz_tc6 took 1284ms
mpz_tc7 took 1448ms
$ ./tc34567 500 2000
Testing sizes a 500 , b 375
mpz_mul took 820ms
mpz_tc3 took 860ms
toom3.5 took 660ms (*)
mpz_tc5 took 712ms
mpz_tc6 took 752ms
mpz_tc7 took 744ms
$ ./tc34567 2500 500
Testing sizes a 2500 , b 1875
mpz_mul took 2496ms
mpz_tc3 took 2280ms
toom3.5 took 1676ms (*)
mpz_tc5 took 1740ms
mpz_tc6 took 1740ms
mpz_tc7 took 1700ms
$ ./tc34567 3500 500
Testing sizes a 3500 , b 2625
mpz_mul took 3904ms
mpz_tc3 took 3720ms
mpz_3.5 took 2532ms (*)
mpz_tc5 took 2917ms
mpz_tc6 took 2716ms
mpz_tc7 took 2780ms
$ ./tc34567 4500 500
Testing sizes a 4500 , b 3375
mpz_mul took 4020ms
mpz_tc3 took 5760ms
toom3.5 took 4020ms
mpz_tc5 took 4156ms
mpz_tc6 took 3624ms (*)
mpz_tc7 took 3664ms
$ tc34567 && ./tc34567 5500 500
Testing sizes a 5500 , b 4125
mpz_mul took 5372ms
mpz_tc3 took 7596ms
toom3.5 took 5097ms
mpz_tc5 took 5552ms
mpz_tc6 took 5269ms
mpz_tc7 took 4880ms (*)
$ tc34567 && ./tc34567 6500 500
Testing sizes a 6500 , b 4875
mpz_mul took 6272ms
mpz_tc3 took 10033ms
toom3.5 took 6480ms
mpz_tc5 took 6793ms
mpz_tc6 took 6548ms
mpz_tc7 took 6037ms (*)

- Toom3.5 will be faster than Toom4, for any size, and this is obvious...
- mpz_tc3 is sometimes (in the 1600-3600 range) faster than mpz_mul.
This may indicate that padding can give worst result than management of
different sizes.


PS: after a good sleep... I changed the code again, now mpz_tc3_5 performs
all we called Toom3.5 in the paper. I mean, it implements both 4x3 and 5x2
computation, chosing at run-time. For the recursive multiply A(Inf)*B(Inf)
it recurse to itself, because the initial factors could be not exactly
5x2 or 4x3, so this product can be unbalanced again!

I made some very rough graph, for comparisons, and this single unbalanced
algorithm perform quite well if compared to balanced Toom nxn, for many
ratios (tested 4x3, 5x3 and 2x1).

More information about the gmp-devel mailing list