<div class="gmail_quote">On Wed, May 13, 2009 at 8:39 PM, Torbjorn Granlund <span dir="ltr"><<a href="mailto:tg@gmplib.org" target="_blank">tg@gmplib.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div>Aleksey Bader <<a href="mailto:bader@newmail.ru" target="_blank">bader@newmail.ru</a>> writes:<br>
<br>
</div><div> 1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word.<br>
<br>
</div>That's what one needs for the small primes FFT, I think. Such numbers give<br>
the needed high-order roots of unity.<br>
<div><br>
2. To get memory locality I also used Cooley-Tukey FFT with precomputed<br>
tables of root powers.<br>
<br>
</div>In how many dimensions?<br>
<br>
Mapping out your data into a two-dimensional FFT is great as long as a row<br>
and column fits well into the (L1) cache. But it degenerates for greater<br>
transforms.<br>
<div><br>
3. Burret reduction was used for fast modular multiplication (Tommy compared<br>
Montgomery and Victor Shoup 's reductions).<br>
<br>
</div>Do you mean "Barrett"? I.e., something along the lines of<br>
<a href="http://gmplib.org/%7Etege/divcnst-pldi94.pdf" target="_blank">http://gmplib.org/~tege/divcnst-pldi94.pdf</a>?<br>
<br>
This is much slower than using REDC or Shoup's trick.<br>
<br>
IIRC, the butterly operation time for an Opteron is around 10 cycles when<br>
using REDC. Shoup's trick might save one or two cycles. Using the trick of<br>
the paper above should get you closer to 20 cycles for a butterly operation.<br>
<div><br>
To be honest, I didn't use GMP with its assembly-optimized basic operations<br>
:-(.<br>
So I wanted to try the combination of my tricks with those you have to check<br>
if it could be useful.<br>
<br>
</div>There are many ideas to try, and several of these appear in either Tommy's<br>
or Niels' implementations.<br>
<br>
1. Vary the number of primes, say 2 to 5, and for a given product size n<br>
choose number of primes so that the transform size becomes as small a<br>
possible.<br>
<br>
2. Use a truncated FFT scheme, such as van der Hoeven's "TFT". This might<br>
make it less critical to vary the number of primes.<br>
<br>
3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought of<br>
how to perform transpositions, and perhaps with padding to avoid<br>
destructive strides when walking the matrix. Use up to 5 dimensional<br>
matrices on 64-bit machines.<br>
<br>
4. Tune the butterly operations in assembly. GMP's code is useless here.<br>
One could contemplate vectorizing techniques for some machines, as<br>
consecutive butterly operations are completely independent.<br>
<br>
5. Really understand how caches work, and take into account the line size<br>
when organizing the computation.<br>
<br>
6. Tune polynomiasation and CRT, even if these operations are O(n). They<br>
take lots of time for medium size operands.<br>
<br>
And more I have forgotten.<br>
<font color="#888888"><br>
--<br>
Torbjörn<br>
<br>
<br>
</font></blockquote></div><font style="color: rgb(0, 0, 0);" color="#888888">Torbjörn</font>,<br><br>Thanks a lot for your advices. I'll try them as soon as I rewrite my code with GMP structures.<br><br>I've
remembered that I'd got an idea of decreasing fft threshold if length
of multipliers significantly differs. Assume we multiply 2 numbers u
and v, with length n и m correspondingly. Let's fix n>m. In that
case we usually make 3*fft(n+m). 3 - because we need to find u image, v
image and recover u*v via inverse fft.<br>
If n/m is big enough then 3*fft(n+m) could be replaced with
(2*n/m+1)*fft(2m) (I can't say how big should be n/m, maybe 3-4 goes
well). It could be worth if m << n, i.e. we can consider v as
"one large digit" that needed to be multiplied by n/m other digits. So
long as we multiply n/m different digits by the same digit (v), then we
can get Fourier image of v only once.<br>
Obviously, fft threshold for equal-order numbers won't change, but it could be less in the case described above.<br>Even
if it does not seems to be practical, that approach might be useful
when number/vector/matrix is multiplied by vector/matrix.<br>
Is it something you have already done? If so, is it profitable?<br><br>
Also I noticed that FFT multiplication is planned to be completely rewritten (<a href="http://gmplib.org/#FUTURE" target="_blank">http://gmplib.org/#FUTURE</a>).<br>
Which method will be used? The new SS approach from Paul Zimmermann,
Pierrick Gaudry and Alexander Kruppa or the approach from Tommy
Färnqvist and Niels Möller?<br><font color="#888888">
<br>Aleksey<br>
</font><br>
PS
[offtopic]<br>
I was <span style="color: rgb(0, 0, 0);">pleasantly surprised to get such kind of </span>detailed and <span style="color: rgb(0, 0, 0);">friendly answers. </span>Good job. Thanks.<br>
[/offtopic]