# 3-prime FFT

**Jason Moxham
**
J.L.Moxham@maths.soton.ac.uk

*Fri, 13 Dec 2002 05:31:41 +0000*

On Friday 13 Dec 2002 12:35 am, Torbjorn Granlund wrote:
>* I'd like to evaluate the 3-prime FFT algorithm for GMP 4.2.
*>* I think that with current GMP implementation techniques, it
*>* could be competitive with the current mul_fft code.
*>*
*>* The idea with the 3-prime algorithm is to compute three
*>* separate results, each with coefficients mod a distinct
*>* limb-sized prime. At the end, CRT is used to reconstruct
*>* the coefficients.
*>*
*>* The inner loops could be made to run very fast on modern
*>* computers. A nice property that will allow high speed is
*>* the non-recurrency of the inner loops.
*>*
*>* I have a basic implementation, originally written back in
*>* the GMP 1.0 times. An updated version is included below.
*>*
*>* This implementation used a umul_ppmm and a udiv_qrnnd in the
*>* inner loop (see the function `fft'). This is silly. At the
*>* very least udiv_qrnnd_preinv should be used. But of course,
*>* the real trick is to use Montgomery multiplication!
*>*
*
Barrett division has the same asymtotic complexity as Montgomery=20
multiplication . I don't know how compairs speed wise for single limbs , =
but=20
it does have the advantage that no conversion between representations is=20
needed , this may or may not be an advantage in this case
>* And then we should write the inner loop in assembly for
*>* important processors. Since the inner loop is simpler than
*>* any current GMP assembly loops, that isn't much work. I
*>* expect it to run at about the same speed as mpn_mul_1.
*>*
*>* (My code also uses a large precomputed table. It would be
*>* great to avoid that, and I think that could be done without
*>* much speed loss. I also think the inverse transform now
*>* computed with `ffi' could use `fft'.)
*>*
*>* Any takers? Kevin? Paul?
*