# n-prime fft multiplication, for n = 1, 2, 3

Niels Möller nisse at lysator.liu.se
Sat Jun 5 20:32:49 CEST 2004

```I've now experimented with one-prime, two-prime and three-prime fft
multiplication.

Short recap of the theory:

Write a number z as

z_0 + z_1 2^l + z_2 2^{2l} + z_3 2^{3l} + ... + z_{n-1} 2^{(n-1)l}

Consider the corresponding polynomial

f(t) = z_0 + z_1 t + ... z_{n-1} t^{n-1}

so that z = f(2^l). f, and hence f(2^l), is uniquely determined by
the value of f in any n distinct points.

If z = x*y, then we can evaluate the polynomials corresponding to x
and y pointwise, multiply pointwise to get the value of f in the same
points, and then reconstruct f and z = f(2^k).

The coefficients x_j, y_j should be in the range [0, ..., 2^l-1], the
coefficients z_j of the product polynomial f will be larger, so we get
some carries when evaluating f(2^l). We have the bound z_j <= n (2^l -
1)^2.

To make things go fast, one should use n = 2^k, and choose the points
n points as 1, w, w^2, ..., w^{n-1}, where w must be chosen so that
w^n = 1 and the powers 1, w, w^2, ..., w^{n-1} really are distinct.
This is what's needed to make fft work.

If we choose a prime

p > n (2^l - 1)^2

then we can do all the computations mod p. And if in addition, p is of
the form

p = 1 + c 2^K, K >= k

then there exists a suitable n:th root w mod p. We can also use
several distinct primes of this form, as long as the product of the
primes is larger than n (2^l - 1)^2: multiply the polynomials mod each
of the primes using fft, and then use the chinese remainder theorem to
reconstruct z_j.

In the code below, I use three fixed 32-bit primes.

/* p0 = 3 * 2^30 + 1 = c0000001, g = 7d */
/* p1 = 13 * 2^28 + 1 = d0000001, g = 1853d3 */
/* p2 = 29 * 2^27 + 1 = e8000001, g = 414338b3 */

The function mul_fft1 uses a variable l in the range 6 <= l <= 15.
mul_fft2 uses a variable l in the range 16 <= l <= 31. mul_fft3 uses l
= 32 (one limb), it might be better to use a slightly larger l.

Another feature that might be interesting is that the code does not
require that the two inputs x and y are of the same size. Actually, an
n x m multiplication, with m < n, is generally faster than n x n.
There are two reasons for this:

* It's sufficient that the power of two used for the fft is
larger than the size of the product, n + m. Hence, it's (sometimes)
possible to use a smaller transform for n x m than for n x n.

* The bound p > n (2^l - 1)^2 is too strict, it's really

p > m (2^l - 1)^2

This means that for a smaller m, we may be able to use a larger l
with the same prime(s).

Ok, now we get to the bad news: It's pretty slow. The below table
shows the running time for n x n multiplication, comparing
karatsuba/toom3 to fft1, fft2 and fft3. For the three fft variants,
the running time is divided between

split:       Splitting the input into l-bit pieces.
mont:        Converting coefficients into montgomery form.
coeff:       Precomputing the fft coefficients used for fft and inverse.
trans:       Fourier transforms of the inputs.
mul:         Pointwise multiplication.
itrans:      Inverse fourier transform.
imont:       Recovering non-montgomery coefficients.
reconstruct: Evaluation of p(2^l), including any needed crt.

Numbers, for 0x400 - 0x4000 limbs.

n    kara/toom    fft1     fft2     fft3

400  0.664062 6.718750 6.406250 4.531250
split   mont  coeff  trans    mul itrans  imont reconstruct
2.3%   0.0%   2.3%  54.7%   1.2%  33.7%   0.0%   5.8% (fft1)
1.2%   1.2%   0.0%  56.8%   1.2%  32.1%   0.0%   7.4% (fft2)
0.0%   1.7%  53.4%   0.0%  31.0%   0.0%  13.8% (fft3)
600  1.250000 15.625000 6.718750 9.687500
split   mont  coeff  trans    mul itrans  imont reconstruct
2.0%   0.0%   2.0%  61.2%   0.0%  32.7%   0.0%   2.0% (fft1)
0.0%   2.3%   1.2%  53.5%   1.2%  27.9%   1.2%  12.8% (fft2)
1.6%   0.0%  59.0%   0.0%  31.1%   1.6%   6.6% (fft3)
800  1.875000 16.250000 14.062500 10.000000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.0%   0.0%   2.0%  60.0%   2.0%  32.0%   2.0%   2.0% (fft1)
0.0%   2.2%   1.1%  55.1%   1.1%  32.6%   1.1%   6.7% (fft2)
0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0% (fft3)
c00  3.437500 58.750000 15.000000 21.250000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.0%   1.1%   0.0%  64.0%   1.1%  32.6%   0.0%   1.1% (fft1)
0.0%   4.3%   4.3%  56.4%   0.0%  33.0%   2.1%   0.0% (fft2)
1.6%   0.0%  54.7%   0.0%  37.5%   0.0%   6.2% (fft3)
1000  5.312500 58.750000 33.750000 21.875000
split   mont  coeff  trans    mul itrans  imont reconstruct
1.1%   2.2%   1.1%  58.4%   2.2%  30.3%   1.1%   3.4% (fft1)
2.0%   0.0%   2.0%  56.9%   0.0%  29.4%   2.0%   7.8% (fft2)
5.9%   0.0%  50.0%   7.4%  26.5%   1.5%   8.8% (fft3)
1800  9.687500 170.000000 36.250000 50.000000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.0%   1.9%   0.0%  63.5%   0.0%  28.8%   0.0%   5.8% (fft1)
0.0%   3.8%   0.0%  54.7%   1.9%  28.3%   1.9%   9.4% (fft2)
0.0%   1.3%  58.7%   0.0%  28.0%   1.3%  10.7% (fft3)
2000  15.625000 175.000000 120.000000 52.500000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.0%   1.9%   1.9%  61.5%   1.9%  28.8%   1.9%   1.9% (fft1)
0.0%   0.0%   0.0%  64.3%   1.2%  32.1%   0.0%   2.4% (fft2)
1.3%   3.8%  53.2%   3.8%  27.8%   0.0%  10.1% (fft3)
3000  28.750000 490.000000 125.000000 180.000000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.0%   0.7%   0.7%  63.9%   0.7%  33.3%   0.7%   0.0% (fft1)
0.0%   0.0%   0.0%  60.5%   2.6%  28.9%   2.6%   5.3% (fft2)
3.7%   0.0%  57.4%   5.6%  27.8%   0.0%   5.6% (fft3)
4000  43.750000 1085.000000 345.000000 185.000000
split   mont  coeff  trans    mul itrans  imont reconstruct
0.3%   0.3%   0.3%  64.9%   0.6%  32.9%   0.3%   0.3% (fft1)
0.0%   1.0%   0.0%  62.5%   1.0%  31.7%   1.0%   2.9% (fft2)
0.0%   0.0%  62.5%   0.0%  32.1%   0.0%   5.4% (fft3)

Conclusions:

* One-prime fft is slowest by far. One should use more than one prime.

* Even at 0x4000 (16384) limbs, fft is four times slower than toom3.
Extrapolating, toom3 looks superior up to at least 0x40000 (262144)
limbs.

* Two-prime and three-prime run at about the same speed. Which
function is fastest depends on how the input size matches the
power-of-two transform sizes.

* The bulk of the running time, about 90%, is spent in the fft
transformations. Next comes reconstruction, but it's less than 10%.

So to make this fft multiplication competitive at limb sizes below
100000, the transform function, fft(), must be made an order of
magnitude faster.

Is that possible at all? The optimizations I can think of is a larger
fft base case (currently, fft with n=4 is inlined, fft with n >= 8
uses recursion), reordering memory accesses to make it more cache
friendly, and rewriting fft and/or redc in assembler. What gains are
possible?

I'm attaching my implementation, and the pike script I used to
generate the parameters in fft_constants.h. The code works only with
GMP_LIMB_BITS = GMP_NUMB_BITS = 32.

Happy hacking,
/Niels

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_fft.c
Type: text/x-c-code
Size: 29679 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/mul_fft-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fft_constants.h
Type: text/x-c-code
Size: 1591 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/fft_constants-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fft.pike
Type: text/x-pike-code
Size: 3870 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/fft-0001.bin
```