Timing Toom-8.5 (mpn level)

Mon Oct 26 14:17:30 CET 2009

bodrato at mail.dm.unipi.it writes:

> I _have_ Toom-6 (the truncated version of Toom-6.5), but its range is so
> narrow that I decided to ignore it... Look at the zoomed graph! In the
> range [250:380] there are _some_ sizes where T6 is faster than both T4 and
> T8, it somehow depends on the congruence Mod 6, Mod 4, Mod 8... but, does
> it worth distinguishing?

I see, it looks quite narrow.

If toom4, toom5, toom6 and toom7 would get a range of less than a 100
limbs or so each (and with small difference in slope at the crossover
points), maybe it's not worth the effort and code size.

> someone (Torbjorn?) suggested that on some architectures addmul_n can be
> slower than add + addlsh.

If the difference is large, I guess one might consider addmul_by3 and
the like... I guess that can be faster only where addmul_1 is limited
by multiplier throughput.

> I tested both branches. If we name the macros... the code will be ready :-D

We'd also need code in tune/tuneup to select the best strategy.

Regards,
/Niels