3-prime FFT

Guillermo Ballester Valor gbv@oxixares.com
Sat, 14 Dec 2002 13:50:46 +0100

On Saturday 14 December 2002 12:14, Torbjorn Granlund wrote:
> Guillermo Ballester Valor <gbv@oxixares.com> writes:
>
>   I already wrote, many time ago, a Two-prime FFT (Proth
>   primes) which runned pretty fast (of course, not competing
>   with real-based FFT).
>
> These algorithms are more competitive on 64-bit machines
> that on 32-bit machines, compared to real-based FFT.
>
>   It is clear to me the need of a three-prime version for
>   32-bits integer based machine, for 64-bits integer ones
>   perhaps we only just need two primes in most cases.
>
> Why is there a difference?
>
I'm thinking that because for 64-bit machines and two primes, we can achi=
eve=20
big basecases with big FFT runlengh. In a 32-bit case it is not true.

> We need two large primes (64-bit or close to 64-bit)
> and one smaller prime.
>
> When multiplying two degree n polynomials, with their
> coefficients < 2^m, the coefficients of the product will be
> < n*(2^m)^2.  One could perhaps say that the third prime
> deals with the accumulation of terms, the n factor.
>

I see, you make the preliminary phase, cut the integer into chunks a lot =
more=20
easy (i.e. let 32 or 64 bits words as chunks). Then the CRT is more=20
complicated than in the two-prime case, and also don't forget we also nee=
d to=20
do three FFT-based mul instead of two although the small prime one were a=
lot=20
faster.=20

> But when we map our integer operands into polynomials, we
> may choose to take some other number than 32 or 64 bits per
> polynomial coefficient, in order to being able to use just
> two primes.  That might result in faster operation at least
> for not-so-huge operands.  Is that how your 2-prime code
> works?
>

Yes !. I don't know what of the two schemes would be better. I like your =
idea=20
too.=20

> An alternative for not-so-huge operands on 32-bit machines
> is to use two 32-bit primes and one 16-bit prime.
> Operations on that last prime will be faster.
>
Hum, for so small size (16 bits) it would be hard to find primes with the=
=20
propierties one usally wants (I'm thinkinig in triplet/pair of primes p s=
uch=20
as p-1 is divisible by k*2^j, k usually a small prime 3,5 or 7 for=20
non_power_of_two FFTS).=20

> For 64-bit machines, we should perhaps always use two 64-bit
> primes and one 32-bit prime, since we don't claim to support
> more than about 2^32 limbs in GMP.
>

I agree

>   I will look at your code (and mine too, I don't rememeber
>   most of my code details).
>
> Good!  The final code for this should probably:
>
> 1. Use Montgomery's REDC tricks
> 2. Use a recursive FFT version
> 3. Use a large basecase
> 4. Optionally use some assembly, perhaps simply for the base case
>
> My code is non-recursive and thus gets 100% cache miss for
> operands over a some size.  Performance drops to a fraction
> of the potential at that point.  Recursive code handles half
> the operand sizes for each recursion, and thus a large part
> of the work will be done on a small-enough data set for good
> caching.
>
> (My code is from 1991, when caches where not as important as
> they are today.)

Mine is newer, from 1999. ;-), and I'm still searching in my backups for =
the=20
code.=20

Guillermo.
--=20
Guillermo Ballester Valor
gbv@oxixares.com