GMP 4.3 multiplication performance

Tue Jun 2 17:23:07 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

  this might indicate that the comparison might perhaps involve un+vn, not only
  the smaller size vn.

Possibly.  But it seems that toom algorithm choice based on vn alone is
better than having it depend on un+vn alone.

  we might imagine that the tuning program might decompose the search space
  in say 1000 regions at most, each one delimited with straight lines, and
  then a few well-chosen comparison should be enough to find the right region.
  Or you might decompose vn/un in say 10 slices, and have a unidimensional
  threshold mechanism for each slice.

Yes, this might be a good idea.  I played a little with splitting the
un,vn area into 10x5 rectangle, just just the other day to see what it
results in.  For each such area, one would compare all primitives
available, and chose the overall best one (for some norm "best").

But I'd really would like to avoid such an approach for several reasons.

1. I think it is unelegant.
2. It will still use quite a large table, even if we make bigger subareas.
3. It will take a lot of time in tuneup to fill out the table...

Some additional thoughts:

Algorithms choice in GMP does not need to be superfast except for for
the smallest operands.  For larger operands, we will spend tens of
thousands of cycles on the actual computation, so if we spend 100 cycles
on choosing te best computation strategy, it might be an overall
improvement.

For multiplication, this means that we need to identify schoolbook
operands quickly.  We should also handle Karatsuba operands without
spending too much time.

Some numbers: We need around 2000 cycles for the 30-limb operand i on
athlon64, which is where Karatsuba kicks in.  We need around 17000
cycles for the smallest toom33 operands.

-- 
Torbjörn