Timing Toom-8.5 (mpn level)

Mon Oct 26 12:48:17 CET 2009

Niels,

>> In short terms (on my old laptop):
>> - GMP: Toom-8.5 is faster than Toom-4 from ~300 limbs ... and keeps on
>> being faster than FFT up to 12,000+ limbs
>
> Cool that it starts winning so early. I wonder what the thresholds
> would be if one also had the intermediate toom5, 6 and 7... It would
> push up the threshold for toom8 and toom8.5 quite a bit, I guess?

I do not know...

I _have_ Toom-6 (the truncated version of Toom-6.5), but its range is so
narrow that I decided to ignore it... Look at the zoomed graph! In the
range [250:380] there are _some_ sizes where T6 is faster than both T4 and
T8, it somehow depends on the congruence Mod 6, Mod 4, Mod 8... but, does
it worth distinguishing?
In MPIR there are Toom-4 and Toom-7, but my Toom-8 gets faster than both
_before_ they get faster than Toom-3!

The old mpz implementation I tested for Toom-4,5,6,7... suggested that
high degree Toom does not have a huge threshold... that's why I'm
skipping.

I have to complete my current goal... but the next can be Toom-10.5 or
higher, I doubt I'll never try intermediate ones...

One question arises:
someone (Torbjorn?) suggested that on some architectures addmul_n can be
slower than add + addlsh.
Are there macros tuned for this purpose? How far can this happen?
Should one consider trading addmul_n(6) with addlsh(1) + addlsh(2)?
Or addmul_n(7) with add+addlsh(1) + addlsh(2)? Or ...

In Toom-8.5 I added things like the following:
  /* FIXME: tuneup should decide the best variant */
#if TUNABLE_MACRO_NOT_NAMED_YET_MAYBE_AORSMUL_FASTER_THAN_AORS_2AORSLSH
  ASSERT_NOCARRY(mpn_submul_1 (r3, r1, n3p1, 3969));
#else
  ASSERT_NOCARRY(mpn_sub_n (r3, r3, r1, n3p1));
  ASSERT_NOCARRY(DO_mpn_addlsh_n (r3, r1, n3p1, 7, wsi));
  ASSERT_NOCARRY(DO_mpn_sublsh_n (r3, r1, n3p1, 12, wsi));
#endif

I tested both branches. If we name the macros... the code will be ready :-D

Regards,
Marco

-- 
http://bodrato.it/software/