Toom-8 testing (mpz level)

Wed Oct 7 15:06:27 CEST 2009

Alberto Zanoni <zanoni at volterra.uniroma2.it> writes:

  Yes, I did it. It's slower because FFT enters the game at 7168. The graphic 
  shows that , at least in my case, it is faster than Toom-8 in 7500-9000 limb 
  interval (for squares, the threshold for FFT is 3840).

OK, but big jumps indicate incorrect tuning.  But for FFT, there is of
course steps, which might be the problem here.  These steps are quite
visible in my diagrams too!

  You're perfectly right: quite a lot of effort still waiting.

And the complexity of GMP's multiplication implementation troubles me.
Don't get me wrong, I am much more obsessed with performance than
elegance.  :-)

We should organise the code to limit the size and complexity.  Niels
started to break out the evaluation into separate functions, see
mpn/generic/toom_eval_dgr3_pm1.c in the GMP repo.

But we shouldn't do this too much; toom22 and toom32 should probably
share no evaluation code since an extra call level would cost too much
for their small operands.

  I've seen the pictures and discussed then with Marco yesterday. It's really 
  difficult to detect an a priori way to separate regions for the different 
  algorithms.

I think the updated graphs might be better.  I've refined the measure-
ment code, and updated the underlying GMP.  There are now also diagrams
for 3 different machine types, so that one may see patterns better.

A main problem now is that the recursive multiplication for the point
at infinity uses mpn_mul (and I think Marco uses mpn_mul for a very
slightly unbalanced multiplication for another point) and mpn_mul does
not make use of sophisticated algorithm selection techniques.

  > Applying toom32 or toom42 partially is as fast or faster in 
  > toom62's applicable interval.  If we implement separate evaluation, we
  > could handle these very unbalanced operands more efficiently by
  > evaluating the shorter operand just once, then reuse it for multiple
  > toom32/42 invocations.  This should give a nice speedup.

  I think so as well.

I should have mentioned that Niels tried this for Karatsuba a while
back.  He saved things to the bottom, i.e., the combined evaluation
until the last iteration before schoolbook.  This is akin to a
transform, as in FFT.

If we start thinking of Toom as transform - dyadic product - inverse
transform, maybe we'll discover some hybrid Toom/FFT multiplication
algorithms...

-- 
Torbjörn