The Karatsuba multiplication algorithm is described in Knuth section 4.3.3 part A, and various other textbooks. A brief description is given here.
The inputs x and y are treated as each split into two parts of equal length (or the most significant part one limb shorter if N is odd).
high low +----------+----------+ | x1 | x0 | +----------+----------+ +----------+----------+ | y1 | y0 | +----------+----------+
Let b be the power of 2 where the split occurs, i.e. if x0 is k limbs (y0 the same) then b=2^(k*mp_bits_per_limb). With that x=x1*b+x0 and y=y1*b+y0, and the following holds,
x*y = (b^2+b)*x1*y1 - b*(x1-x0)*(y1-y0) + (b+1)*x0*y0
This formula means doing only three multiplies of (N/2)x(N/2) limbs, whereas a basecase multiply of NxN limbs is equivalent to four multiplies of (N/2)x(N/2). The factors (b^2+b) etc represent the positions where the three products must be added.
high low +--------+--------+ +--------+--------+ | x1*y1 | | x0*y0 | +--------+--------+ +--------+--------+ +--------+--------+ add | x1*y1 | +--------+--------+ +--------+--------+ add | x0*y0 | +--------+--------+ +--------+--------+ sub | (x1-x0)*(y1-y0) | +--------+--------+
The term (x1-x0)*(y1-y0) is best calculated as an absolute value, and the sign used to choose to add or subtract. Notice the sum high(x0*y0)+low(x1*y1) occurs twice, so it’s possible to do 5*k limb additions, rather than 6*k, but in GMP extra function call overheads outweigh the saving.
Squaring is similar to multiplying, but with x=y the formula reduces to an equivalent with three squares,
x^2 = (b^2+b)*x1^2 - b*(x1-x0)^2 + (b+1)*x0^2
The final result is accumulated from those three squares the same way as for the three multiplies above. The middle term (x1-x0)^2 is now always positive.
A similar formula for both multiplying and squaring can be constructed with a middle term (x1+x0)*(y1+y0). But those sums can exceed k limbs, leading to more carry handling and additions than the form above.
Karatsuba multiplication is asymptotically an O(N^1.585) algorithm,
the exponent being log(3)/log(2), representing 3 multiplies
each 1/2 the size of the inputs. This is a big improvement over the
basecase multiply at O(N^2) and the advantage soon overcomes the extra
additions Karatsuba performs. MUL_TOOM22_THRESHOLD
can be as little
as 10 limbs. The SQR
threshold is usually about twice the MUL
.
The basecase algorithm will take a time of the form M(N) = a*N^2 + b*N + c and the Karatsuba algorithm K(N) = 3*M(N/2) + d*N + e, which expands to K(N) = 3/4*a*N^2 + 3/2*b*N + 3*c + d*N + e. The
factor 3/4 for a means per-crossproduct speedups in the
basecase code will increase the threshold since they benefit M(N) more
than K(N). And conversely the 3/2 for b means
linear style speedups of b will increase the threshold since they
benefit K(N) more than M(N). The latter can be seen for
instance when adding an optimized mpn_sqr_diagonal
to
mpn_sqr_basecase
. Of course all speedups reduce total time, and in
that sense the algorithm thresholds are merely of academic interest.