# Division interfaces

Torbjorn Granlund tg at gmplib.org
Tue Mar 15 14:09:34 CET 2011

```nisse at lysator.liu.se (Niels Möller) writes:

nisse at lysator.liu.se (Niels Möller) writes:

> And then we have the issue of unnormalized divisors (it looks like
> mpn_sbpi1_div_qr still requires that dp[dn-1] has the high bit set.

I'm attaching a first crude version with on-the-fly normalization of the
high limbs.

If one reads the top 3 limbs of n and top 2 of d and shift left to
normalize d (shifting in zeros), and use 3/2 division, one will in
effect do a (2 limbs + k bits)/(1 limb + k bits) division, where 0 < k =
(GMP_LIMB_BITS - shift_count). Then the probability of the "unlikely"
update in schoolbook division is on the order of 2^{-k}, which for small
k (large shift count) actually isn't very unlikely.

For this reason, the code uses 4 limbs of n and 3 limbs of d (and a 4/3
division built from 3/2 division and an extra update step). This is
overkill when the shift count is small, so maybe one should have three
different variants,

1. Normalized d.

2. Shift count < (GMP_LIMB_BITS - 10). Use 3/2 division, the "unlikely"
update happens with probability < 2^{-10}.

3. Larger shift counts. Use 4/3 division (like the attached code), then
the unlikely update probability is < 2^{-GMP_LIMB_BITS}.

There is actually some need for:

4. Side-channel silent SB division.  The O(log(D)) corrections must be
performed always, with their effect nullified when not needed
(using mpn_subcnd_n).  Here, we should not worry about making a
quotient limb approximation correct with particularly high
probability.

I looked briefly at your code.  Wouldn't it make sense to stay with 3/2
division, using the 2 x GMP_NUMB_BITS most significant bits from the
divisor, and the 3 x GMP_NUMB_BITS bits from the partial remainder?
Thus, feed udiv_qr_3by2 with *exactly* the sequence of parameters it

The 3 x GMP_NUMB_BITS partial remainder bits would need to be formed by
reading the 4 most significant partial remainder limbs from memory.  But
since we during quotient approximation (i.e., in udiv_qr_3by2) compute a
part of the next partial remainder, we can shorten the submul_1 trip
count, and not only save submul_1 execution time, but also simplify
formation of the next 3 partial remainder limbs that are to be fed to
udiv_qr_3by2.

Am I missing some reason for udiv_qr_4by3?

(I'd say to skip the 2^{-10} variant for now, and concentrate on making
the generic version as fast as possible.)

--
Torbjörn
```