udiv_qr_3by2 vs divappr

Thu Aug 2 21:22:22 UTC 2018

nisse at lysator.liu.se (Niels Möller) writes:

> I have also had another look at the analysis, and I'm fairly confident
> the algorithm is correct also for the corner cases of d_0 == 0 and {n1,
> n0} very close to {d1, d0}.

I was trying to write up the analysis, and then I found another nice way
to simplify the algorithm.

We're computing the quotient floor (B {u1, u0} / {d1, d0}), and we need
a special case at the very least for u1 == d1 and u0 == d0, to set q =
B-1 in that case.

In the algorithm, I needed to also handle {u1,u0} = {d1,d0} - 1 and - 2
specially, to rule out that the otherwise allowed "one-too-large" error
would attempt to return q = B. But it's just as easy to single out *all*
numbers for which q = B-1 is the correct quotient. We have

  floor (B {u1, u0} / {d1, d0}) >= B - 1

if and only if

  {u1, u0} >= {d1, d0} - d1

So replace the {d1, d0} - 2 check in the code by 

  /* d - d1 = {d1md1, d0 - d1} */
  d1md1 = d1 - (d0 < d1);

  for (...)
    { 
      if (UNLIKELY (n1 >= d1md1) && (n1 > d1md1 || n0 >= (d0-d1)))
	q = GMP_NUMB_MASK;

This UNLIKELY condition then gets slightly less unlikely, but I don't
think that's a problem, probability should still be on the order of 1/B
for random inputs.

The nice thing is, that when we compute the very first quotient
approximation as

  {q1,q0} = v u1 + {u1,u0}

we have q1 <= correct quotient <= B-2.

So q1 + 1 can't overflow, and that simplifies its effects with d0. So
if the caller takes care of the q = B-1 cases as above, then divappr can
be reduced to

static inline mp_limb_t
divappr_2 (mp_limb_t n2, mp_limb_t n1,
	   mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
{
  mp_limb_t q1, q0, r1, t1, t0, mask;

  umul_ppmm (q1, q0, n2, dinv);
  add_ssaaaa (q1, q0, q1, q0, n2, n1);

  q1++;  /* Can't overflow */
  umul_ppmm (t1, t0, q1, d0);  /* t0 isn't used */
  r1 = n1 - d1 * q1 - t1;

  mask = - (mp_limb_t) (r1 >= q0);
  q1 += mask;
  r1 += mask & (d1 + 1);
  q1 += (r1 >= d1 - 1);

  return q1;
}

So umul + umullo + umulhi, only a single double-precision add, and
fairly simple adjustment steps. To compare with 

#define udiv_qr_3by2(q, r1, r0, n2, n1, n0, d1, d0, dinv)		\
  do {									\
    mp_limb_t _q0, _t1, _t0, _mask;					\
    umul_ppmm ((q), _q0, (n2), (dinv));					\
    add_ssaaaa ((q), _q0, (q), _q0, (n2), (n1));			\
									\
    /* Compute the two most significant limbs of n - q'd */		\
    (r1) = (n1) - (d1) * (q);						\
    sub_ddmmss ((r1), (r0), (r1), (n0), (d1), (d0));			\
    umul_ppmm (_t1, _t0, (d0), (q));					\
    sub_ddmmss ((r1), (r0), (r1), (r0), _t1, _t0);			\
    (q)++;								\
									\
    /* Conditionally adjust q and the remainders */			\
    _mask = - (mp_limb_t) ((r1) >= _q0);				\
    (q) += _mask;							\
    add_ssaaaa ((r1), (r0), (r1), (r0), _mask & (d1), _mask & (d0));	\
    if (UNLIKELY ((r1) >= (d1)))					\
      {									\
	if ((r1) > (d1) || (r0) >= (d0))				\
	  {								\
	    (q)++;							\
	    sub_ddmmss ((r1), (r0), (r1), (r0), (d1), (d0));		\
	  }								\
      }									\
  } while (0)

To be fair, it does some extra work to get the remainder. A fair
comparison should have on one hand, udiv_qr_3by2 + the logic in
mpn_sbpi1_div_qr to progate the the mpn_submul_1 into those two
remainder words, and on the other hand, divappr2 + two extra iterations
in mpn_submul_1.

And the trick to take care of all inputs producing q = B-1 up front
could maybe be used to eliminate the first sub_ddmmss above, but
analysis is slightly different due to the third numerator limb.

I'll leave a run of

  while GMP_CHECK_RANDOMIZE=0 ./tests/mpn/t-div ; do : ; done

overnight. So far it seems to work fine. No benchmarking yet,

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.