bader at newmail.ru
Thu May 14 01:42:33 CEST 2009
On Wed, May 13, 2009 at 8:39 PM, Torbjorn Granlund <tg at gmplib.org> wrote:
> Aleksey Bader <bader at newmail.ru> writes:
> 1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word.
> That's what one needs for the small primes FFT, I think. Such numbers give
> the needed high-order roots of unity.
> 2. To get memory locality I also used Cooley-Tukey FFT with precomputed
> tables of root powers.
> In how many dimensions?
> Mapping out your data into a two-dimensional FFT is great as long as a row
> and column fits well into the (L1) cache. But it degenerates for greater
> 3. Burret reduction was used for fast modular multiplication (Tommy
> Montgomery and Victor Shoup 's reductions).
> Do you mean "Barrett"? I.e., something along the lines of
> This is much slower than using REDC or Shoup's trick.
> IIRC, the butterly operation time for an Opteron is around 10 cycles when
> using REDC. Shoup's trick might save one or two cycles. Using the trick
> the paper above should get you closer to 20 cycles for a butterly
> To be honest, I didn't use GMP with its assembly-optimized basic
> So I wanted to try the combination of my tricks with those you have to
> if it could be useful.
> There are many ideas to try, and several of these appear in either Tommy's
> or Niels' implementations.
> 1. Vary the number of primes, say 2 to 5, and for a given product size n
> choose number of primes so that the transform size becomes as small a
> 2. Use a truncated FFT scheme, such as van der Hoeven's "TFT". This might
> make it less critical to vary the number of primes.
> 3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought
> how to perform transpositions, and perhaps with padding to avoid
> destructive strides when walking the matrix. Use up to 5 dimensional
> matrices on 64-bit machines.
> 4. Tune the butterly operations in assembly. GMP's code is useless here.
> One could contemplate vectorizing techniques for some machines, as
> consecutive butterly operations are completely independent.
> 5. Really understand how caches work, and take into account the line size
> when organizing the computation.
> 6. Tune polynomiasation and CRT, even if these operations are O(n). They
> take lots of time for medium size operands.
> And more I have forgotten.
Thanks a lot for your advices. I'll try them as soon as I rewrite my code
with GMP structures.
I've remembered that I'd got an idea of decreasing fft threshold if length
of multipliers significantly differs. Assume we multiply 2 numbers u and v,
with length n и m correspondingly. Let's fix n>m. In that case we usually
make 3*fft(n+m). 3 - because we need to find u image, v image and recover
u*v via inverse fft.
If n/m is big enough then 3*fft(n+m) could be replaced with
(2*n/m+1)*fft(2m) (I can't say how big should be n/m, maybe 3-4 goes well).
It could be worth if m << n, i.e. we can consider v as "one large digit"
that needed to be multiplied by n/m other digits. So long as we multiply n/m
different digits by the same digit (v), then we can get Fourier image of v
Obviously, fft threshold for equal-order numbers won't change, but it could
be less in the case described above.
Even if it does not seems to be practical, that approach might be useful
when number/vector/matrix is multiplied by vector/matrix.
Is it something you have already done? If so, is it profitable?
Also I noticed that FFT multiplication is planned to be completely rewritten
Which method will be used? The new SS approach from Paul Zimmermann,
Pierrick Gaudry and Alexander Kruppa or the approach from Tommy Färnqvist
and Niels Möller?
I was pleasantly surprised to get such kind of detailed and friendly
answers. Good job. Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gmp-devel