n-prime fft multiplication, for n = 1, 2, 3
Niels Möller
nisse at lysator.liu.se
Sat Jun 5 20:32:49 CEST 2004
I've now experimented with one-prime, two-prime and three-prime fft
multiplication.
Short recap of the theory:
Write a number z as
z_0 + z_1 2^l + z_2 2^{2l} + z_3 2^{3l} + ... + z_{n-1} 2^{(n-1)l}
Consider the corresponding polynomial
f(t) = z_0 + z_1 t + ... z_{n-1} t^{n-1}
so that z = f(2^l). f, and hence f(2^l), is uniquely determined by
the value of f in any n distinct points.
If z = x*y, then we can evaluate the polynomials corresponding to x
and y pointwise, multiply pointwise to get the value of f in the same
points, and then reconstruct f and z = f(2^k).
The coefficients x_j, y_j should be in the range [0, ..., 2^l-1], the
coefficients z_j of the product polynomial f will be larger, so we get
some carries when evaluating f(2^l). We have the bound z_j <= n (2^l -
1)^2.
To make things go fast, one should use n = 2^k, and choose the points
n points as 1, w, w^2, ..., w^{n-1}, where w must be chosen so that
w^n = 1 and the powers 1, w, w^2, ..., w^{n-1} really are distinct.
This is what's needed to make fft work.
If we choose a prime
p > n (2^l - 1)^2
then we can do all the computations mod p. And if in addition, p is of
the form
p = 1 + c 2^K, K >= k
then there exists a suitable n:th root w mod p. We can also use
several distinct primes of this form, as long as the product of the
primes is larger than n (2^l - 1)^2: multiply the polynomials mod each
of the primes using fft, and then use the chinese remainder theorem to
reconstruct z_j.
In the code below, I use three fixed 32-bit primes.
/* p0 = 3 * 2^30 + 1 = c0000001, g = 7d */
/* p1 = 13 * 2^28 + 1 = d0000001, g = 1853d3 */
/* p2 = 29 * 2^27 + 1 = e8000001, g = 414338b3 */
The function mul_fft1 uses a variable l in the range 6 <= l <= 15.
mul_fft2 uses a variable l in the range 16 <= l <= 31. mul_fft3 uses l
= 32 (one limb), it might be better to use a slightly larger l.
Another feature that might be interesting is that the code does not
require that the two inputs x and y are of the same size. Actually, an
n x m multiplication, with m < n, is generally faster than n x n.
There are two reasons for this:
* It's sufficient that the power of two used for the fft is
larger than the size of the product, n + m. Hence, it's (sometimes)
possible to use a smaller transform for n x m than for n x n.
* The bound p > n (2^l - 1)^2 is too strict, it's really
p > m (2^l - 1)^2
This means that for a smaller m, we may be able to use a larger l
with the same prime(s).
Ok, now we get to the bad news: It's pretty slow. The below table
shows the running time for n x n multiplication, comparing
karatsuba/toom3 to fft1, fft2 and fft3. For the three fft variants,
the running time is divided between
split: Splitting the input into l-bit pieces.
mont: Converting coefficients into montgomery form.
coeff: Precomputing the fft coefficients used for fft and inverse.
trans: Fourier transforms of the inputs.
mul: Pointwise multiplication.
itrans: Inverse fourier transform.
imont: Recovering non-montgomery coefficients.
reconstruct: Evaluation of p(2^l), including any needed crt.
Numbers, for 0x400 - 0x4000 limbs.
n kara/toom fft1 fft2 fft3
400 0.664062 6.718750 6.406250 4.531250
split mont coeff trans mul itrans imont reconstruct
2.3% 0.0% 2.3% 54.7% 1.2% 33.7% 0.0% 5.8% (fft1)
1.2% 1.2% 0.0% 56.8% 1.2% 32.1% 0.0% 7.4% (fft2)
0.0% 1.7% 53.4% 0.0% 31.0% 0.0% 13.8% (fft3)
600 1.250000 15.625000 6.718750 9.687500
split mont coeff trans mul itrans imont reconstruct
2.0% 0.0% 2.0% 61.2% 0.0% 32.7% 0.0% 2.0% (fft1)
0.0% 2.3% 1.2% 53.5% 1.2% 27.9% 1.2% 12.8% (fft2)
1.6% 0.0% 59.0% 0.0% 31.1% 1.6% 6.6% (fft3)
800 1.875000 16.250000 14.062500 10.000000
split mont coeff trans mul itrans imont reconstruct
0.0% 0.0% 2.0% 60.0% 2.0% 32.0% 2.0% 2.0% (fft1)
0.0% 2.2% 1.1% 55.1% 1.1% 32.6% 1.1% 6.7% (fft2)
0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% (fft3)
c00 3.437500 58.750000 15.000000 21.250000
split mont coeff trans mul itrans imont reconstruct
0.0% 1.1% 0.0% 64.0% 1.1% 32.6% 0.0% 1.1% (fft1)
0.0% 4.3% 4.3% 56.4% 0.0% 33.0% 2.1% 0.0% (fft2)
1.6% 0.0% 54.7% 0.0% 37.5% 0.0% 6.2% (fft3)
1000 5.312500 58.750000 33.750000 21.875000
split mont coeff trans mul itrans imont reconstruct
1.1% 2.2% 1.1% 58.4% 2.2% 30.3% 1.1% 3.4% (fft1)
2.0% 0.0% 2.0% 56.9% 0.0% 29.4% 2.0% 7.8% (fft2)
5.9% 0.0% 50.0% 7.4% 26.5% 1.5% 8.8% (fft3)
1800 9.687500 170.000000 36.250000 50.000000
split mont coeff trans mul itrans imont reconstruct
0.0% 1.9% 0.0% 63.5% 0.0% 28.8% 0.0% 5.8% (fft1)
0.0% 3.8% 0.0% 54.7% 1.9% 28.3% 1.9% 9.4% (fft2)
0.0% 1.3% 58.7% 0.0% 28.0% 1.3% 10.7% (fft3)
2000 15.625000 175.000000 120.000000 52.500000
split mont coeff trans mul itrans imont reconstruct
0.0% 1.9% 1.9% 61.5% 1.9% 28.8% 1.9% 1.9% (fft1)
0.0% 0.0% 0.0% 64.3% 1.2% 32.1% 0.0% 2.4% (fft2)
1.3% 3.8% 53.2% 3.8% 27.8% 0.0% 10.1% (fft3)
3000 28.750000 490.000000 125.000000 180.000000
split mont coeff trans mul itrans imont reconstruct
0.0% 0.7% 0.7% 63.9% 0.7% 33.3% 0.7% 0.0% (fft1)
0.0% 0.0% 0.0% 60.5% 2.6% 28.9% 2.6% 5.3% (fft2)
3.7% 0.0% 57.4% 5.6% 27.8% 0.0% 5.6% (fft3)
4000 43.750000 1085.000000 345.000000 185.000000
split mont coeff trans mul itrans imont reconstruct
0.3% 0.3% 0.3% 64.9% 0.6% 32.9% 0.3% 0.3% (fft1)
0.0% 1.0% 0.0% 62.5% 1.0% 31.7% 1.0% 2.9% (fft2)
0.0% 0.0% 62.5% 0.0% 32.1% 0.0% 5.4% (fft3)
Conclusions:
* One-prime fft is slowest by far. One should use more than one prime.
* Even at 0x4000 (16384) limbs, fft is four times slower than toom3.
Extrapolating, toom3 looks superior up to at least 0x40000 (262144)
limbs.
* Two-prime and three-prime run at about the same speed. Which
function is fastest depends on how the input size matches the
power-of-two transform sizes.
* The bulk of the running time, about 90%, is spent in the fft
transformations. Next comes reconstruction, but it's less than 10%.
So to make this fft multiplication competitive at limb sizes below
100000, the transform function, fft(), must be made an order of
magnitude faster.
Is that possible at all? The optimizations I can think of is a larger
fft base case (currently, fft with n=4 is inlined, fft with n >= 8
uses recursion), reordering memory accesses to make it more cache
friendly, and rewriting fft and/or redc in assembler. What gains are
possible?
I'm attaching my implementation, and the pike script I used to
generate the parameters in fft_constants.h. The code works only with
GMP_LIMB_BITS = GMP_NUMB_BITS = 32.
Happy hacking,
/Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mul_fft.c
Type: text/x-c-code
Size: 29679 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/mul_fft-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fft_constants.h
Type: text/x-c-code
Size: 1591 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/fft_constants-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fft.pike
Type: text/x-pike-code
Size: 3870 bytes
Desc: not available
Url : /list-archives/gmp-discuss/attachments/20040605/55e99bd4/fft-0001.bin
More information about the gmp-discuss
mailing list