udiv_qr_3by2 vs divappr
Niels Möller
nisse at lysator.liu.se
Thu Aug 2 21:22:22 UTC 2018
nisse at lysator.liu.se (Niels Möller) writes:
> I have also had another look at the analysis, and I'm fairly confident
> the algorithm is correct also for the corner cases of d_0 == 0 and {n1,
> n0} very close to {d1, d0}.
I was trying to write up the analysis, and then I found another nice way
to simplify the algorithm.
We're computing the quotient floor (B {u1, u0} / {d1, d0}), and we need
a special case at the very least for u1 == d1 and u0 == d0, to set q =
B-1 in that case.
In the algorithm, I needed to also handle {u1,u0} = {d1,d0} - 1 and - 2
specially, to rule out that the otherwise allowed "one-too-large" error
would attempt to return q = B. But it's just as easy to single out *all*
numbers for which q = B-1 is the correct quotient. We have
floor (B {u1, u0} / {d1, d0}) >= B - 1
if and only if
{u1, u0} >= {d1, d0} - d1
So replace the {d1, d0} - 2 check in the code by
/* d - d1 = {d1md1, d0 - d1} */
d1md1 = d1 - (d0 < d1);
for (...)
{
if (UNLIKELY (n1 >= d1md1) && (n1 > d1md1 || n0 >= (d0-d1)))
q = GMP_NUMB_MASK;
This UNLIKELY condition then gets slightly less unlikely, but I don't
think that's a problem, probability should still be on the order of 1/B
for random inputs.
The nice thing is, that when we compute the very first quotient
approximation as
{q1,q0} = v u1 + {u1,u0}
we have q1 <= correct quotient <= B-2.
So q1 + 1 can't overflow, and that simplifies its effects with d0. So
if the caller takes care of the q = B-1 cases as above, then divappr can
be reduced to
static inline mp_limb_t
divappr_2 (mp_limb_t n2, mp_limb_t n1,
mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
{
mp_limb_t q1, q0, r1, t1, t0, mask;
umul_ppmm (q1, q0, n2, dinv);
add_ssaaaa (q1, q0, q1, q0, n2, n1);
q1++; /* Can't overflow */
umul_ppmm (t1, t0, q1, d0); /* t0 isn't used */
r1 = n1 - d1 * q1 - t1;
mask = - (mp_limb_t) (r1 >= q0);
q1 += mask;
r1 += mask & (d1 + 1);
q1 += (r1 >= d1 - 1);
return q1;
}
So umul + umullo + umulhi, only a single double-precision add, and
fairly simple adjustment steps. To compare with
#define udiv_qr_3by2(q, r1, r0, n2, n1, n0, d1, d0, dinv) \
do { \
mp_limb_t _q0, _t1, _t0, _mask; \
umul_ppmm ((q), _q0, (n2), (dinv)); \
add_ssaaaa ((q), _q0, (q), _q0, (n2), (n1)); \
\
/* Compute the two most significant limbs of n - q'd */ \
(r1) = (n1) - (d1) * (q); \
sub_ddmmss ((r1), (r0), (r1), (n0), (d1), (d0)); \
umul_ppmm (_t1, _t0, (d0), (q)); \
sub_ddmmss ((r1), (r0), (r1), (r0), _t1, _t0); \
(q)++; \
\
/* Conditionally adjust q and the remainders */ \
_mask = - (mp_limb_t) ((r1) >= _q0); \
(q) += _mask; \
add_ssaaaa ((r1), (r0), (r1), (r0), _mask & (d1), _mask & (d0)); \
if (UNLIKELY ((r1) >= (d1))) \
{ \
if ((r1) > (d1) || (r0) >= (d0)) \
{ \
(q)++; \
sub_ddmmss ((r1), (r0), (r1), (r0), (d1), (d0)); \
} \
} \
} while (0)
To be fair, it does some extra work to get the remainder. A fair
comparison should have on one hand, udiv_qr_3by2 + the logic in
mpn_sbpi1_div_qr to progate the the mpn_submul_1 into those two
remainder words, and on the other hand, divappr2 + two extra iterations
in mpn_submul_1.
And the trick to take care of all inputs producing q = B-1 up front
could maybe be used to eliminate the first sub_ddmmss above, but
analysis is slightly different due to the third numerator limb.
I'll leave a run of
while GMP_CHECK_RANDOMIZE=0 ./tests/mpn/t-div ; do : ; done
overnight. So far it seems to work fine. No benchmarking yet,
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list