NTT multiplication.

Wed May 13 18:39:39 CEST 2009

Aleksey Bader <bader at newmail.ru> writes:

  1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word.

That's what one needs for the small primes FFT, I think.  Such numbers give
the needed high-order roots of unity.

  2. To get memory locality I also used Cooley-Tukey FFT with precomputed
  tables of root powers.

In how many dimensions?

Mapping out your data into a two-dimensional FFT is great as long as a row
and column fits well into the (L1) cache.  But it degenerates for greater
transforms.

  3. Burret reduction was used for fast modular multiplication (Tommy compared
  Montgomery and Victor Shoup 's reductions).

Do you mean "Barrett"?  I.e., something along the lines of
http://gmplib.org/~tege/divcnst-pldi94.pdf?

This is much slower than using REDC or Shoup's trick.

IIRC, the butterly operation time for an Opteron is around 10 cycles when
using REDC.  Shoup's trick might save one or two cycles.  Using the trick of
the paper above should get you closer to 20 cycles for a butterly operation.

  To be honest, I didn't use GMP with its assembly-optimized basic operations
  :-(.
  So I wanted to try the combination of my tricks with those you have to check
  if it could be useful.

There are many ideas to try, and several of these appear in either Tommy's
or Niels' implementations.

1. Vary the number of primes, say 2 to 5, and for a given product size n
   choose number of primes so that the transform size becomes as small a
   possible.

2. Use a truncated FFT scheme, such as van der Hoeven's "TFT".  This might
   make it less critical to vary the number of primes.

3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought of
   how to perform transpositions, and perhaps with padding to avoid
   destructive strides when walking the matrix.  Use up to 5 dimensional
   matrices on 64-bit machines.

4. Tune the butterly operations in assembly.  GMP's code is useless here.
   One could contemplate vectorizing techniques for some machines, as
   consecutive butterly operations are completely independent.

5. Really understand how caches work, and take into account the line size
   when organizing the computation.

6. Tune polynomiasation and CRT, even if these operations are O(n).  They
   take lots of time for medium size operands.

And more I have forgotten.

-- 
Torbjörn