15.1.6 FFT Multiplication

At large to very large sizes a Fermat style FFT multiplication is used, following Schönhage and Strassen (see References). Descriptions of FFTs in various forms can be found in many textbooks, for instance Knuth section 4.3.3 part C or Lipson chapter IX. A brief description of the form used in GMP is given here.

The multiplication done is x*y mod 2^N+1, for a given N. A full product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x and y with high zero limbs. The modular product is the native form for the algorithm, so padding to get a full product is unavoidable.

The algorithm follows a split, evaluate, pointwise multiply, interpolate and combine similar to that described above for Karatsuba and Toom-3. A k parameter controls the split, with an FFT-k splitting into 2^k pieces of M=N/2^k bits each. N must be a multiple of (2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding bit shifts in the split and combine stages.

The evaluations, pointwise multiplications, and interpolation are all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of 2^k and of mp_bits_per_limb. The results of interpolation will be the following negacyclic convolution of the input pieces, and the choice of N' ensures these sums aren’t truncated.

           ---
           \         b
w[n] =     /     (-1) * x[i] * y[j]
           ---
       i+j==b*2^k+n
          b=0,1

The points used for the evaluation are g^i for i=0 to 2^k-1 where g=2^(2N'/2^k). g is a 2^k'th root of unity mod 2^N'+1, which produces necessary cancellations at the interpolation stage, and it’s also a power of 2 so the fast Fourier transforms used for the evaluation and interpolation do only shifts, adds and negations.

The pointwise multiplications are done modulo 2^N'+1 and either recurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba or basecase), whichever is optimal at the size N'. The interpolation is an inverse fast Fourier transform. The resulting set of sums of x[i]*y[j] are added at appropriate offsets to give the final result.

Squaring is the same, but x is the only input so it’s one transform at the evaluate stage and the pointwise multiplies are squares. The interpolation is the same.

For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm, the exponent representing 2^k recursed modular multiplies each 1/2^(k-1) the size of the original. Each successive k is an asymptotic improvement, but overheads mean each is only faster at bigger and bigger sizes. In the code, MUL_FFT_TABLE and SQR_FFT_TABLE are the thresholds where each k is used. Each new k effectively swaps some multiplying for some shifts, adds and overheads.

A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply plus a subtraction, so an FFT and Toom-3 etc can be compared directly. A k=4 FFT at O(N^1.333) can be expected to be the first faster than Toom-3 at O(N^1.465). In practice this is what’s found, with MUL_FFT_MODF_THRESHOLD and SQR_FFT_MODF_THRESHOLD being between 300 and 1000 limbs, depending on the CPU. So far it’s been found that only very large FFTs recurse into pointwise multiplies above these sizes.

When an FFT is to give a full product, the change of N to 2N doesn’t alter the theoretical complexity for a given k, but for the purposes of considering where an FFT might be first used it can be assumed that the FFT is recursing into a normal multiply and that on that basis it’s doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs, making it O(N^(k/(k-2))). This would mean k=7 at O(N^1.4) would be the first FFT faster than Toom-3. In practice MUL_FFT_THRESHOLD and SQR_FFT_THRESHOLD have been found to be in the k=8 range, somewhere between 3000 and 10000 limbs.

The way N is split into 2^k pieces and then 2M+k+3 is rounded up to a multiple of 2^k and mp_bits_per_limb means that when 2^k>=mp\_bits\_per\_limb the effective N is a multiple of 2^(2k-1) bits. The +k+3 means some values of N just under such a multiple will be rounded to the next. The complexity calculations above assume that a favourable size is used, meaning one which isn’t padded through rounding, and it’s also assumed that the extra +k+3 bits are negligible at typical FFT sizes.

The practical effect of the 2^(2k-1) constraint is to introduce a step-effect into measured speeds. For example k=8 will round N up to a multiple of 32768 bits, so for a 32-bit limb there’ll be 512 limb groups of sizes for which mpn_mul_n runs at the same speed. Or for k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. In practice it’s been found each k is used at quite small multiples of its size constraint and so the step effect is quite noticeable in a time versus size graph.

The threshold determinations currently measure at the mid-points of size steps, but this is sub-optimal since at the start of a new step it can happen that it’s better to go back to the previous k for a while. Something more sophisticated for MUL_FFT_TABLE and SQR_FFT_TABLE will be needed.