<div class="gmail_quote">On Wed, May 13, 2009 at 8:39 PM, Torbjorn Granlund <<a href="mailto:tg@gmplib.org" target="_blank">tg@gmplib.org</a>> wrote: <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <div>Aleksey Bader <<a href="mailto:bader@newmail.ru" target="_blank">bader@newmail.ru</a>> writes: </div><div> 1. I used 3 prime numbers of the form l*2^k+1 that fit 32-bit word. </div>That's what one needs for the small primes FFT, I think. Such numbers give the needed high-order roots of unity. <div> 2. To get memory locality I also used Cooley-Tukey FFT with precomputed tables of root powers. </div>In how many dimensions? Mapping out your data into a two-dimensional FFT is great as long as a row and column fits well into the (L1) cache. But it degenerates for greater transforms. <div> 3. Burret reduction was used for fast modular multiplication (Tommy compared Montgomery and Victor Shoup 's reductions). </div>Do you mean "Barrett"? I.e., something along the lines of <a href="http://gmplib.org/%7Etege/divcnst-pldi94.pdf" target="_blank">http://gmplib.org/~tege/divcnst-pldi94.pdf</a>? This is much slower than using REDC or Shoup's trick. IIRC, the butterly operation time for an Opteron is around 10 cycles when using REDC. Shoup's trick might save one or two cycles. Using the trick of the paper above should get you closer to 20 cycles for a butterly operation. <div> To be honest, I didn't use GMP with its assembly-optimized basic operations :-(. So I wanted to try the combination of my tricks with those you have to check if it could be useful. </div>There are many ideas to try, and several of these appear in either Tommy's or Niels' implementations. 1. Vary the number of primes, say 2 to 5, and for a given product size n choose number of primes so that the transform size becomes as small a possible. 2. Use a truncated FFT scheme, such as van der Hoeven's "TFT". This might make it less critical to vary the number of primes. 3. Use cache friendly schemes (Cooley-Tukey, Bailey) with careful thought of how to perform transpositions, and perhaps with padding to avoid destructive strides when walking the matrix. Use up to 5 dimensional matrices on 64-bit machines. 4. Tune the butterly operations in assembly. GMP's code is useless here. One could contemplate vectorizing techniques for some machines, as consecutive butterly operations are completely independent. 5. Really understand how caches work, and take into account the line size when organizing the computation. 6. Tune polynomiasation and CRT, even if these operations are O(n). They take lots of time for medium size operands. And more I have forgotten. -- Torbjörn </blockquote></div>Torbjörn, Thanks a lot for your advices. I'll try them as soon as I rewrite my code with GMP structures. I've remembered that I'd got an idea of decreasing fft threshold if length of multipliers significantly differs. Assume we multiply 2 numbers u and v, with length n и m correspondingly. Let's fix n>m. In that case we usually make 3*fft(n+m). 3 - because we need to find u image, v image and recover u*v via inverse fft. If n/m is big enough then 3*fft(n+m) could be replaced with (2*n/m+1)*fft(2m) (I can't say how big should be n/m, maybe 3-4 goes well). It could be worth if m << n, i.e. we can consider v as "one large digit" that needed to be multiplied by n/m other digits. So long as we multiply n/m different digits by the same digit (v), then we can get Fourier image of v only once. Obviously, fft threshold for equal-order numbers won't change, but it could be less in the case described above. Even if it does not seems to be practical, that approach might be useful when number/vector/matrix is multiplied by vector/matrix. Is it something you have already done? If so, is it profitable? Also I noticed that FFT multiplication is planned to be completely rewritten (<a href="http://gmplib.org/#FUTURE" target="_blank">http://gmplib.org/#FUTURE</a>). Which method will be used? The new SS approach from Paul Zimmermann, Pierrick Gaudry and Alexander Kruppa or the approach from Tommy Färnqvist and Niels Möller? Aleksey PS [offtopic] I was pleasantly surprised to get such kind of detailed and friendly answers. Good job. Thanks. [/offtopic]