Iterative Toom-Cook-2.5 (Toom32) method for unbalanced multiplication

Tue Dec 22 18:48:29 CET 2009

Dear Alberto,

That very much for this contribution to the GNU project!

Last year, Niels Möller implemented toom22 (aka karatsuba) using
separate "transforms" of one or both operands.  IIRC, he saw savings of
around 10% per pre-transformed operand.

His goal was speeding up computations with explicit invariance, such as
matrix multiplication with bignum matrix elements.

Your goal is a bit different; you want to optimise the implicit
invariance of unbalanced multiplication.

Explicit and implicit invariance is the next big thing I hope to pursue
in GMP, so your contribution comes very timely!

  2) Divide-by-2 technique, moving shifts from interpolation phase to
     evaluation phase, when possible. For Toom-2.5 this means that if
     evaluation of b in 1 (and therefore in -1, too) is even, it is
     possible to do one single shift in b evaluation phase, and no
     shifts at all in all interpolation phases during iteration.

You could perhaps always shift, and save any lost bit separately?

Why did you chose toom32 as a primitive for very unbalanced operations.
Which would be more efficient, tom32 or, say, toom42?

One cost related to very unbalanced multiplication is the assembly of
the toom-generated pieces.  This could be avoided by making the
interpolation code understand the situation; it could add instead of
writ the low n product limbs.  Do you do that?

  The code contains two version of the functions. The first one (1)
  considers only parity of b evaluation, the second (2) of b and of a's
  sections. In my experiments, (1) seems to be a bit faster than (2) on
  my architecture, but I report both versions. Maybe testing on other
  architectures could give different results. Compile with

I don't understand "of b and of a's sections".

  I made some tests comparing the code with gmp-4.3.1, the development
  version (December 2nd) and with a small modification suggested by
  Marco. The maximum obtained gains lie in my case between 26 and 33 %.
  I hope the code can someway be useful. Obviously corrections,
  comments, changes, optimisations and whatever else are all welcome. Thank you 
  very much in advance.

Nice speedups!

I hope you will not be disappointed if this code does not get integrated
until after the upcoming release?

-- 
Torbjörn