NTT multiplication.

Thu May 14 01:42:33 CEST 2009

On Wed, May 13, 2009 at 8:39 PM, Torbjorn Granlund <tg at gmplib.org> wrote:

> Aleksey Bader <bader at newmail.ru> writes:
>
>   1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word.
>
> That's what one needs for the small primes FFT, I think.  Such numbers give
> the needed high-order roots of unity.
>
>  2. To get memory locality I also used Cooley-Tukey FFT with precomputed
>  tables of root powers.
>
> In how many dimensions?
>
> Mapping out your data into a two-dimensional FFT is great as long as a row
> and column fits well into the (L1) cache.  But it degenerates for greater
> transforms.
>
>  3. Burret reduction was used for fast modular multiplication (Tommy
> compared
>  Montgomery and Victor Shoup 's reductions).
>
> Do you mean "Barrett"?  I.e., something along the lines of
> http://gmplib.org/~tege/divcnst-pldi94.pdf<http://gmplib.org/%7Etege/divcnst-pldi94.pdf>
> ?
>
> This is much slower than using REDC or Shoup's trick.
>
> IIRC, the butterly operation time for an Opteron is around 10 cycles when
> using REDC.  Shoup's trick might save one or two cycles.  Using the trick
> of
> the paper above should get you closer to 20 cycles for a butterly
> operation.
>
>  To be honest, I didn't use GMP with its assembly-optimized basic
> operations
>  :-(.
>  So I wanted to try the combination of my tricks with those you have to
> check
>  if it could be useful.
>
> There are many ideas to try, and several of these appear in either Tommy's
> or Niels' implementations.
>
> 1. Vary the number of primes, say 2 to 5, and for a given product size n
>   choose number of primes so that the transform size becomes as small a
>   possible.
>
> 2. Use a truncated FFT scheme, such as van der Hoeven's "TFT".  This might
>   make it less critical to vary the number of primes.
>
> 3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought
> of
>   how to perform transpositions, and perhaps with padding to avoid
>   destructive strides when walking the matrix.  Use up to 5 dimensional
>   matrices on 64-bit machines.
>
> 4. Tune the butterly operations in assembly.  GMP's code is useless here.
>   One could contemplate vectorizing techniques for some machines, as
>   consecutive butterly operations are completely independent.
>
> 5. Really understand how caches work, and take into account the line size
>   when organizing the computation.
>
> 6. Tune polynomiasation and CRT, even if these operations are O(n).  They
>   take lots of time for medium size operands.
>
> And more I have forgotten.
>
> --
> Torbjörn
>
>
> Torbjörn,

Thanks a lot for your advices. I'll try them as soon as I rewrite my code
with GMP structures.

I've remembered that I'd got an idea of decreasing fft threshold if length
of multipliers significantly differs. Assume we multiply 2 numbers u and v,
with length n и m correspondingly. Let's fix n>m. In that case we usually
make 3*fft(n+m). 3 - because we need to find u image, v image and recover
u*v via inverse fft.
If n/m is big enough then 3*fft(n+m) could be replaced with
(2*n/m+1)*fft(2m) (I can't say how big should be n/m, maybe 3-4 goes well).
It could be worth if m << n, i.e. we can consider v as "one large digit"
that needed to be multiplied by n/m other digits. So long as we multiply n/m
different digits by the same digit (v), then we can get Fourier image of v
only once.
Obviously, fft threshold for equal-order numbers won't change, but it could
be less in the case described above.
Even if it does not seems to be practical, that approach might be useful
when number/vector/matrix is multiplied by vector/matrix.
Is it something you have already done? If so, is it profitable?

Also I noticed that FFT multiplication is planned to be completely rewritten
(http://gmplib.org/#FUTURE).
Which method will be used? The new SS approach from Paul Zimmermann,
Pierrick Gaudry and Alexander Kruppa or the approach from Tommy Färnqvist
and Niels Möller?

Aleksey

PS [offtopic]
I was pleasantly surprised to get such kind of detailed and friendly
answers. Good job. Thanks.
[/offtopic]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20090514/e01e0847/attachment.html>