3-prime FFT

Jason Moxham J.L.Moxham@maths.soton.ac.uk
Fri, 13 Dec 2002 05:31:41 +0000


On Friday 13 Dec 2002 12:35 am, Torbjorn Granlund wrote:
> I'd like to evaluate the 3-prime FFT algorithm for GMP 4.2.
> I think that with current GMP implementation techniques, it
> could be competitive with the current mul_fft code.
>
> The idea with the 3-prime algorithm is to compute three
> separate results, each with coefficients mod a distinct
> limb-sized prime.  At the end, CRT is used to reconstruct
> the coefficients.
>
> The inner loops could be made to run very fast on modern
> computers.  A nice property that will allow high speed is
> the non-recurrency of the inner loops.
>
> I have a basic implementation, originally written back in
> the GMP 1.0 times.  An updated version is included below.
>
> This implementation used a umul_ppmm and a udiv_qrnnd in the
> inner loop (see the function `fft').  This is silly.  At the
> very least udiv_qrnnd_preinv should be used.  But of course,
> the real trick is to use Montgomery multiplication!
>

Barrett division has the same asymtotic complexity as Montgomery=20
multiplication . I don't know how compairs speed wise for single limbs , =
but=20
it does have the advantage that no conversion between representations is=20
needed , this may or may not be an advantage in this case


> And then we should write the inner loop in assembly for
> important processors.  Since the inner loop is simpler than
> any current GMP assembly loops, that isn't much work.  I
> expect it to run at about the same speed as mpn_mul_1.
>
> (My code also uses a large precomputed table.  It would be
> great to avoid that, and I think that could be done without
> much speed loss.  I also think the inverse transform now
> computed with `ffi' could use `fft'.)
>
> Any takers?  Kevin?  Paul?