GMP 4.3 multiplication performance

Tue Jun 2 15:12:34 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

  > While working on multiplication improvements for GMP 4.4, I realized
  > that we are doing something quite silly for unbalanced operands in GMP
  > 4.3.  For some machines, GMP 4.3 might actually be substantially slower
  > than GMP 4.2!
  >
  > The current multiply code works roughly like this:
  >
  >   if (operands have same size)
  >     use mpn_sqr_n or mpn_mul_n
  >
  >   if (schoolbook operand range)
  >     use mpn_mul_basecase
  >
  >   if (fft range)
  >     use mpn_mul_fft_full
  >
  >   else use toom42, toom32, or toom22, or some combination of these.
  >
  > The problem is that for any unbalanced operands between the schoolbook
  > range and the fft range, we will basically use karatsuba's algorithm,
  > not any toom variant.  This is pretty awful except for smallish
  > operands.

  I don't understand: you should fall down to the "else" part above, which
  tries toom42, toom32, or toom22.

I meant that "use" in the pseudocode is a committment, i.e., that one
does not try other alternatives later in the code.

  Moreover what does mean "schoolbook operand range" and "fft range" for
  unbalanced operands. Do you test the smaller size, the larger size, the
  average of both?

You put the finger exactly at the sore point.  :-)

The way we do that in GMP is typically to look at the smaller size (here
called vn).  It works reasonably well.

But I am working on various refinements, without resorting to
two-dimensional algorithm selection tables.

The first thing is to separate THRESHOLD values for balanced and
unbalanced multiplication; it turns out that the unbalanced cases
typically should stay with an algorithm longer (i.e., for larger
operands) than balanced cases.

But before doing more work on mpn_mul, I want more toomMN primitives, in
particular toom43, toom53 (which exists but is currently unused) and
perhaps toom63.  Then we might actually also want toom54, 64, 74, 84.
Yes, it is getting crazy.  :-)

  What you really need is a two-dimensional threshold scheme.

Yes, but that might be quite difficult to implement in practice:

  enum mul_algorithm choose[10000][100000];
  switch (choose[un][vn])
    {
      case SCHOOLBOOK:
        ...
      case KARATSUBA:
        ...
      case TOOM32:
        ...
      ...
      ...
    }

Or do you have something less memory-hungry in mind?  :-)

  Marco Bodrato did produce a corresponding graph some time ago.

Where?

-- 
Torbjörn