GMP 4.3 multiplication performance

Sun May 31 19:35:17 CEST 2009

While working on multiplication improvements for GMP 4.4, I realized
that we are doing something quite silly for unbalanced operands in GMP
4.3.  For some machines, GMP 4.3 might actually be substantially slower
than GMP 4.2!

The current multiply code works roughly like this:

  if (operands have same size)
    use mpn_sqr_n or mpn_mul_n

  if (schoolbook operand range)
    use mpn_mul_basecase

  if (fft range)
    use mpn_mul_fft_full

  else use toom42, toom32, or toom22, or some combination of these.

The problem is that for any unbalanced operands between the schoolbook
range and the fft range, we will basically use karatsuba's algorithm,
not any toom variant.  This is pretty awful except for smallish
operands.

The use of unbalanced toom/karatsuba was introduced with GMP 4.3, and is
a great improvement for many operands.  Unfortunately, it is done too
naively.

Some measurements demonstrating the effects:

         mpn_mul.7900  mpn_mul.7901
7900     #11635165.00   20355963.00
         mpn_mul.8000  mpn_mul.8001
8000     #10745592.00   10750716.00

I.e., a 7900-limb operand times a 7900-lim operand takes 11635165 cycles
on an athlon 64.  But a I.e., a 7901-limb operand times a 7900-lim
operand takes 20355963 cycles.

In the FFT range, as for 8000 limbs, things are back to normal.  The FFT
code handles arbitrarily unbalanced operands (although its performance
degrades when the operands are very unbalanced).

(Note that a balanced 7900-limb product is also slower than a 8000-limb
balanced product.  That's another flaw.)

-- 
Torbjörn