Toom-8 testing (mpz level)

Wed Oct 7 13:36:30 CEST 2009

Alberto Zanoni <zanoni at volterra.uniroma2.it> writes:

  Dear developers,
                              I implemented a mpz version of Toom-8 algorithm 
  for multiplication. On my architecture I obtained someway interesting 
  results. You can find two graphics, for multiplication and squaring, at

  http://bodrato.it/papers/zanoni.html#CIVV2009b

I think you want to say "Generally speaking, only the degree 1
(Karatsuba), 2 and 3 ...", if you're speaking of polynomial degree
(which you seem to be doing since toom8 is said to use degree 7).

Why is the code slower in the 7000-9000 limb interval?  I suspect it
might be due to improper MUL_*_THRESHOLD values for the recursive
products.  Did you "make tune" on your machine and recompile with the
generated results?

  (under "Toom-8 way for long integers multiplication" - thanks Marco !)

  As I'm waiting for the publication of the proceedings of the SYNASC 2009 
  conference, where the related paper was presented, currently I cannot put the 
  paper itself, with all the details, on the web, but actually Marco is working 
  quite ahead of it, and will surely produce something better at mpn level, 
  with many new ideas, so I think that the paper is actually already obsolete.

I suppose that's a common problem.  I hope you can put an up-to-date
version on the web at some point?

  Just to point out that Toom-Cook-k methods with k as high as 8 (may be even 
  more ?) can be effective and worth implementing.

While I very much appreciate your work, I am not sure I am happy about
its outcome: many toom functions will be needed for optimal performance.
:-)

Lots of unbalanced toomMN also ask to be implemented:

Using 15 evaluation points: toom97, toomA6, toomB5, toomC4, toomD3.

We currently have the complete set of toom function for 7 evaluation
points and less (toom22, 32, 33, 42, 43, 52, 44, 53, 62) but 43, 52, 62,
and 53 are not actually used.

On my most-wanted list is 8-point evaluation (at the mpn level).  I
suspect the evaluation point ±1, ±1/2, ±2, 0, inf would be best, but I
suppose you understand that better than me.  8-point would give toom54,
63, 72, but 72 is probably not too useful.  I am very curious how toom63
would compare to toom42, given that 6/3 = 4/2.

My ongoing measurements of multiplication primitives (please see the
pictures at http://gmplib.org/devel/) indicate that toom62 is not
important.  Applying toom32 or toom42 partially is as fast or faster in
toom62's applicable interval.  If we implement separate evaluation, we
could handle these very unbalanced operands more efficiently by
evaluating the shorter operand just once, then reuse it for multiple
toom32/42 invocations.  This should give a nice speedup.

Have people though about unbalanced FFT operands?  FFT works naturally
for arbitrarily unbalanced operands, but it gets less and less
efficient; the run time is a function f of the product, where f is
C*n*log(n)*log(log(n)) and n is the output size.

I suspect (say) toom42 calling fft will be faster for operands whose
sizes are the the right relation.  Of course, one would want to
transform the smaller operand just once, like above for partial toom.

-- 
Torbjörn