tg at gmplib.org
Wed May 13 18:39:39 CEST 2009
Aleksey Bader <bader at newmail.ru> writes:
1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word.
That's what one needs for the small primes FFT, I think. Such numbers give
the needed high-order roots of unity.
2. To get memory locality I also used Cooley-Tukey FFT with precomputed
tables of root powers.
In how many dimensions?
Mapping out your data into a two-dimensional FFT is great as long as a row
and column fits well into the (L1) cache. But it degenerates for greater
3. Burret reduction was used for fast modular multiplication (Tommy compared
Montgomery and Victor Shoup 's reductions).
Do you mean "Barrett"? I.e., something along the lines of
This is much slower than using REDC or Shoup's trick.
IIRC, the butterly operation time for an Opteron is around 10 cycles when
using REDC. Shoup's trick might save one or two cycles. Using the trick of
the paper above should get you closer to 20 cycles for a butterly operation.
To be honest, I didn't use GMP with its assembly-optimized basic operations
So I wanted to try the combination of my tricks with those you have to check
if it could be useful.
There are many ideas to try, and several of these appear in either Tommy's
or Niels' implementations.
1. Vary the number of primes, say 2 to 5, and for a given product size n
choose number of primes so that the transform size becomes as small a
2. Use a truncated FFT scheme, such as van der Hoeven's "TFT". This might
make it less critical to vary the number of primes.
3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought of
how to perform transpositions, and perhaps with padding to avoid
destructive strides when walking the matrix. Use up to 5 dimensional
matrices on 64-bit machines.
4. Tune the butterly operations in assembly. GMP's code is useless here.
One could contemplate vectorizing techniques for some machines, as
consecutive butterly operations are completely independent.
5. Really understand how caches work, and take into account the line size
when organizing the computation.
6. Tune polynomiasation and CRT, even if these operations are O(n). They
take lots of time for medium size operands.
And more I have forgotten.
More information about the gmp-devel