Next: Basecase Division, Previous: Division Algorithms, Up: Division Algorithms [Index]

Nx1 division is implemented using repeated 2x1 divisions from high to low, either with a hardware divide instruction or a multiplication by inverse, whichever is best on a given CPU.

The multiply by inverse follows “Improved division by invariant integers” by
Möller and Granlund (see References) and is implemented as
`udiv_qrnnd_preinv`

in `gmp-impl.h`. The idea is to have a
fixed-point approximation to *1/d* (see `invert_limb`

) and then
multiply by the high limb (plus one bit) of the dividend to get a quotient
*q*. With *d* normalized (high bit set), *q* is no more than 1
too small. Subtracting *q*d* from the dividend gives a remainder, and
reveals whether *q* or *q-1* is correct.

The result is a division done with two multiplications and four or five arithmetic operations. On CPUs with low latency multipliers this can be much faster than a hardware divide, though the cost of calculating the inverse at the start may mean it’s only better on inputs bigger than say 4 or 5 limbs.

When a divisor must be normalized, either for the generic C
`__udiv_qrnnd_c`

or the multiply by inverse, the division performed is
actually *a*2^k* by *d*2^k* where *a* is the dividend and
*k* is the power necessary to have the high bit of *d*2^k* set.
The bit shifts for the dividend are usually accomplished “on the fly”
meaning by extracting the appropriate bits at each step. Done this way the
quotient limbs come out aligned ready to store. When only the remainder is
wanted, an alternative is to take the dividend limbs unshifted and calculate
*r = a mod d*2^k* followed by an extra final step *r*2^k mod d*2^k*. This can help on CPUs with poor bit shifts or
few registers.

The multiply by inverse can be done two limbs at a time. The calculation is
basically the same, but the inverse is two limbs and the divisor treated as if
padded with a low zero limb. This means more work, since the inverse will
need a 2x2 multiply, but the four 1x1s to do that are
independent and can therefore be done partly or wholly in parallel. Likewise
for a 2x1 calculating *q*d*. The net effect is to process two
limbs with roughly the same two multiplies worth of latency that one limb at a
time gives. This extends to 3 or 4 limbs at a time, though the extra work to
apply the inverse will almost certainly soon reach the limits of multiplier
throughput.

A similar approach in reverse can be taken to process just half a limb at a
time if the divisor is only a half limb. In this case the 1x1 multiply
for the inverse effectively becomes two *(1/2)x1* for each
limb, which can be a saving on CPUs with a fast half limb multiply, or in fact
if the only multiply is a half limb, and especially if it’s not pipelined.

Next: Basecase Division, Previous: Division Algorithms, Up: Division Algorithms [Index]