3-prime FFT

Torbjorn Granlund tege@swox.com
14 Dec 2002 12:14:22 +0100


Guillermo Ballester Valor <gbv@oxixares.com> writes:

  I already wrote, many time ago, a Two-prime FFT (Proth
  primes) which runned pretty fast (of course, not competing
  with real-based FFT).

These algorithms are more competitive on 64-bit machines
that on 32-bit machines, compared to real-based FFT.

  It is clear to me the need of a three-prime version for
  32-bits integer based machine, for 64-bits integer ones
  perhaps we only just need two primes in most cases.

Why is there a difference?

We need two large primes (64-bit or close to 64-bit)
and one smaller prime.

When multiplying two degree n polynomials, with their
coefficients < 2^m, the coefficients of the product will be
< n*(2^m)^2.  One could perhaps say that the third prime
deals with the accumulation of terms, the n factor.

But when we map our integer operands into polynomials, we
may choose to take some other number than 32 or 64 bits per
polynomial coefficient, in order to being able to use just
two primes.  That might result in faster operation at least
for not-so-huge operands.  Is that how your 2-prime code
works?

An alternative for not-so-huge operands on 32-bit machines
is to use two 32-bit primes and one 16-bit prime.
Operations on that last prime will be faster.

For 64-bit machines, we should perhaps always use two 64-bit
primes and one 32-bit prime, since we don't claim to support
more than about 2^32 limbs in GMP.

  I will look at your code (and mine too, I don't rememeber
  most of my code details).

Good!  The final code for this should probably:

1. Use Montgomery's REDC tricks
2. Use a recursive FFT version
3. Use a large basecase
4. Optionally use some assembly, perhaps simply for the base case

My code is non-recursive and thus gets 100% cache miss for
operands over a some size.  Performance drops to a fraction
of the potential at that point.  Recursive code handles half
the operand sizes for each recursion, and thus a large part
of the work will be done on a small-enough data set for good
caching.

(My code is from 1991, when caches where not as important as
they are today.)

--
Torbjörn