Fri, 13 Dec 2002 05:31:41 +0000
On Friday 13 Dec 2002 12:35 am, Torbjorn Granlund wrote:
> I'd like to evaluate the 3-prime FFT algorithm for GMP 4.2.
> I think that with current GMP implementation techniques, it
> could be competitive with the current mul_fft code.
> The idea with the 3-prime algorithm is to compute three
> separate results, each with coefficients mod a distinct
> limb-sized prime. At the end, CRT is used to reconstruct
> the coefficients.
> The inner loops could be made to run very fast on modern
> computers. A nice property that will allow high speed is
> the non-recurrency of the inner loops.
> I have a basic implementation, originally written back in
> the GMP 1.0 times. An updated version is included below.
> This implementation used a umul_ppmm and a udiv_qrnnd in the
> inner loop (see the function `fft'). This is silly. At the
> very least udiv_qrnnd_preinv should be used. But of course,
> the real trick is to use Montgomery multiplication!
Barrett division has the same asymtotic complexity as Montgomery=20
multiplication . I don't know how compairs speed wise for single limbs , =
it does have the advantage that no conversion between representations is=20
needed , this may or may not be an advantage in this case
> And then we should write the inner loop in assembly for
> important processors. Since the inner loop is simpler than
> any current GMP assembly loops, that isn't much work. I
> expect it to run at about the same speed as mpn_mul_1.
> (My code also uses a large precomputed table. It would be
> great to avoid that, and I think that could be done without
> much speed loss. I also think the inverse transform now
> computed with `ffi' could use `fft'.)
> Any takers? Kevin? Paul?