Nx1 division is implemented using repeated 2x1 divisions from high to low, either with a hardware divide instruction or a multiplication by inverse, whichever is best on a given CPU.
The multiply by inverse follows “Improved division by invariant integers” by
Möller and Granlund (see References) and is implemented as
udiv_qrnnd_preinv
in gmp-impl.h. The idea is to have a
fixed-point approximation to 1/d (see invert_limb
) and then
multiply by the high limb (plus one bit) of the dividend to get a quotient
q. With d normalized (high bit set), q is no more than 1
too small. Subtracting q*d from the dividend gives a remainder, and
reveals whether q or q-1 is correct.
The result is a division done with two multiplications and four or five arithmetic operations. On CPUs with low latency multipliers this can be much faster than a hardware divide, though the cost of calculating the inverse at the start may mean it’s only better on inputs bigger than say 4 or 5 limbs.
When a divisor must be normalized, either for the generic C
__udiv_qrnnd_c
or the multiply by inverse, the division performed is
actually a*2^k by d*2^k where a is the dividend and
k is the power necessary to have the high bit of d*2^k set.
The bit shifts for the dividend are usually accomplished “on the fly”
meaning by extracting the appropriate bits at each step. Done this way the
quotient limbs come out aligned ready to store. When only the remainder is
wanted, an alternative is to take the dividend limbs unshifted and calculate
r = a mod d*2^k followed by an extra final step r*2^k mod d*2^k. This can help on CPUs with poor bit shifts or
few registers.
The multiply by inverse can be done two limbs at a time. The calculation is basically the same, but the inverse is two limbs and the divisor treated as if padded with a low zero limb. This means more work, since the inverse will need a 2x2 multiply, but the four 1x1s to do that are independent and can therefore be done partly or wholly in parallel. Likewise for a 2x1 calculating q*d. The net effect is to process two limbs with roughly the same two multiplies worth of latency that one limb at a time gives. This extends to 3 or 4 limbs at a time, though the extra work to apply the inverse will almost certainly soon reach the limits of multiplier throughput.
A similar approach in reverse can be taken to process just half a limb at a time if the divisor is only a half limb. In this case the 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each limb, which can be a saving on CPUs with a fast half limb multiply, or in fact if the only multiply is a half limb, and especially if it’s not pipelined.