# udiv_qr_3by2 vs divappr

Niels Möller nisse at lysator.liu.se
Thu Aug 2 21:22:22 UTC 2018

```nisse at lysator.liu.se (Niels Möller) writes:

> I have also had another look at the analysis, and I'm fairly confident
> the algorithm is correct also for the corner cases of d_0 == 0 and {n1,
> n0} very close to {d1, d0}.

I was trying to write up the analysis, and then I found another nice way
to simplify the algorithm.

We're computing the quotient floor (B {u1, u0} / {d1, d0}), and we need
a special case at the very least for u1 == d1 and u0 == d0, to set q =
B-1 in that case.

In the algorithm, I needed to also handle {u1,u0} = {d1,d0} - 1 and - 2
specially, to rule out that the otherwise allowed "one-too-large" error
would attempt to return q = B. But it's just as easy to single out *all*
numbers for which q = B-1 is the correct quotient. We have

floor (B {u1, u0} / {d1, d0}) >= B - 1

if and only if

{u1, u0} >= {d1, d0} - d1

So replace the {d1, d0} - 2 check in the code by

/* d - d1 = {d1md1, d0 - d1} */
d1md1 = d1 - (d0 < d1);

for (...)
{
if (UNLIKELY (n1 >= d1md1) && (n1 > d1md1 || n0 >= (d0-d1)))

This UNLIKELY condition then gets slightly less unlikely, but I don't
think that's a problem, probability should still be on the order of 1/B
for random inputs.

The nice thing is, that when we compute the very first quotient
approximation as

{q1,q0} = v u1 + {u1,u0}

we have q1 <= correct quotient <= B-2.

So q1 + 1 can't overflow, and that simplifies its effects with d0. So
if the caller takes care of the q = B-1 cases as above, then divappr can
be reduced to

static inline mp_limb_t
divappr_2 (mp_limb_t n2, mp_limb_t n1,
mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
{
mp_limb_t q1, q0, r1, t1, t0, mask;

umul_ppmm (q1, q0, n2, dinv);
add_ssaaaa (q1, q0, q1, q0, n2, n1);

q1++;  /* Can't overflow */
umul_ppmm (t1, t0, q1, d0);  /* t0 isn't used */
r1 = n1 - d1 * q1 - t1;

mask = - (mp_limb_t) (r1 >= q0);
r1 += mask & (d1 + 1);
q1 += (r1 >= d1 - 1);

return q1;
}

So umul + umullo + umulhi, only a single double-precision add, and
fairly simple adjustment steps. To compare with

#define udiv_qr_3by2(q, r1, r0, n2, n1, n0, d1, d0, dinv)		\
do {									\
mp_limb_t _q0, _t1, _t0, _mask;					\
umul_ppmm ((q), _q0, (n2), (dinv));					\
add_ssaaaa ((q), _q0, (q), _q0, (n2), (n1));			\
\
/* Compute the two most significant limbs of n - q'd */		\
(r1) = (n1) - (d1) * (q);						\
sub_ddmmss ((r1), (r0), (r1), (n0), (d1), (d0));			\
umul_ppmm (_t1, _t0, (d0), (q));					\
sub_ddmmss ((r1), (r0), (r1), (r0), _t1, _t0);			\
(q)++;								\
\
/* Conditionally adjust q and the remainders */			\
_mask = - (mp_limb_t) ((r1) >= _q0);				\
if (UNLIKELY ((r1) >= (d1)))					\
{									\
if ((r1) > (d1) || (r0) >= (d0))				\
{								\
(q)++;							\
sub_ddmmss ((r1), (r0), (r1), (r0), (d1), (d0));		\
}								\
}									\
} while (0)

To be fair, it does some extra work to get the remainder. A fair
comparison should have on one hand, udiv_qr_3by2 + the logic in
mpn_sbpi1_div_qr to progate the the mpn_submul_1 into those two
remainder words, and on the other hand, divappr2 + two extra iterations
in mpn_submul_1.

And the trick to take care of all inputs producing q = B-1 up front
could maybe be used to eliminate the first sub_ddmmss above, but
analysis is slightly different due to the third numerator limb.

I'll leave a run of

while GMP_CHECK_RANDOMIZE=0 ./tests/mpn/t-div ; do : ; done

overnight. So far it seems to work fine. No benchmarking yet,

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
```