At large to very large sizes a Fermat style FFT multiplication is used, following Schönhage and Strassen (see References). Descriptions of FFTs in various forms can be found in many textbooks, for instance Knuth section 4.3.3 part C or Lipson chapter IX. A brief description of the form used in GMP is given here.
The multiplication done is x*y mod 2^N+1, for a given N. A full product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x and y with high zero limbs. The modular product is the native form for the algorithm, so padding to get a full product is unavoidable.
The algorithm follows a split, evaluate, pointwise multiply, interpolate and
combine similar to that described above for Karatsuba and Toom-3. A k
parameter controls the split, with an FFT-k splitting into 2^k
pieces of M=N/2^k bits each. N must be a multiple of
the split falls on limb boundaries, avoiding bit shifts in the split and
The evaluations, pointwise multiplications, and interpolation, are all done
modulo 2^N'+1 where N' is 2M+k+3 rounded up to a
multiple of 2^k and of
mp_bits_per_limb. The results of
interpolation will be the following negacyclic convolution of the input
pieces, and the choice of N' ensures these sums aren't truncated.
--- \ b w[n] = / (-1) * x[i] * y[j] --- i+j==b*2^k+n b=0,1
The points used for the evaluation are g^i for i=0 to 2^k-1 where g=2^(2N'/2^k). g is a 2^k'th root of unity mod 2^N'+1, which produces necessary cancellations at the interpolation stage, and it's also a power of 2 so the fast Fourier transforms used for the evaluation and interpolation do only shifts, adds and negations.
The pointwise multiplications are done modulo 2^N'+1 and either recurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba or basecase), whichever is optimal at the size N'. The interpolation is an inverse fast Fourier transform. The resulting set of sums of x[i]*y[j] are added at appropriate offsets to give the final result.
Squaring is the same, but x is the only input so it's one transform at the evaluate stage and the pointwise multiplies are squares. The interpolation is the same.
For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm, the exponent representing 2^k recursed
modular multiplies each 1/2^(k-1) the size of the original.
Each successive k is an asymptotic improvement, but overheads mean each
is only faster at bigger and bigger sizes. In the code,
SQR_FFT_TABLE are the thresholds where each k is used. Each
new k effectively swaps some multiplying for some shifts, adds and
A mod 2^N+1 product can be formed with a normal
NxN->2N bit multiply plus a subtraction, so an FFT
and Toom-3 etc can be compared directly. A k=4 FFT at
O(N^1.333) can be expected to be the first faster than Toom-3 at
O(N^1.465). In practice this is what's found, with
SQR_FFT_MODF_THRESHOLD being between
300 and 1000 limbs, depending on the CPU. So far it's been found that only
very large FFTs recurse into pointwise multiplies above these sizes.
When an FFT is to give a full product, the change of N to 2N
doesn't alter the theoretical complexity for a given k, but for the
purposes of considering where an FFT might be first used it can be assumed
that the FFT is recursing into a normal multiply and that on that basis it's
doing 2^k recursed multiplies each 1/2^(k-2) the size of
the inputs, making it O(N^(k/(k-2))). This would mean
k=7 at O(N^1.4) would be the first FFT faster than Toom-3.
SQR_FFT_THRESHOLD have been
found to be in the k=8 range, somewhere between 3000 and 10000 limbs.
The way N is split into 2^k pieces and then 2M+k+3 is
rounded up to a multiple of 2^k and
mp_bits_per_limb means that
mp_bits_per_limb the effective N is a
multiple of 2^(2k-1) bits. The +k+3 means some values of
N just under such a multiple will be rounded to the next. The
complexity calculations above assume that a favourable size is used, meaning
one which isn't padded through rounding, and it's also assumed that the extra
+k+3 bits are negligible at typical FFT sizes.
The practical effect of the 2^(2k-1) constraint is to introduce a
step-effect into measured speeds. For example k=8 will round N
up to a multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb
groups of sizes for which
mpn_mul_n runs at the same speed. Or for
k=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. In
practice it's been found each k is used at quite small multiples of its
size constraint and so the step effect is quite noticeable in a time versus
The threshold determinations currently measure at the mid-points of size
steps, but this is sub-optimal since at the start of a new step it can happen
that it's better to go back to the previous k for a while. Something
more sophisticated for
SQR_FFT_TABLE will be