udiv_qr_3by2 vs divappr

Wed Sep 5 07:34:33 UTC 2018

tg at gmplib.org (Torbjörn Granlund) writes:

> Why is (the obsolete function) mpn_tdiv_qr's interface relevant
> here?

Because it's the only top-level function, and because a fair amount of
the newer division code also doesn't clobber np (it seems dc might allow
clobbering n, and the mu code doesn't).

> You time just the lower-level sbpi1 functions, right?

Right.

> It might be clean to have divappr_2 as a separate function, but perhaps
> expanding its code in the mpn_sbpi1_div_qr loop would expose the
> possibility for decreasing the submul_1 size argument.  If the measured
> speedup is less than you expected, perhaps the old code's dn-2 size
> argument explains some of it?

Could make sense, but it's going to be a bit tricky. Let q' be the initial
candidate quotient, and q be the return value of divappr_2. Then the new
code needs

  mullo (q', d1)
  umulhi (q', d0)

as part of divappr_2, and then it recomputes the full products q*d1 and
q*d0, as part of submul_1 call. It would be nice to at least reduce the
submul_1 count by one, to eliminate the full product q*d1. But to do
that, we'd need divappr2 to either return some kind of useful remainder,
or compute the full product q' * d1, and adjust and return it.

One can note that the net adjustment, q - q', is 0, 1 or -1, with +1
being unlikely, I think. To shorten the submul_1 count by one, what's
needed is a remainder

   {u_1, u_0} - q d_1

This doesn't quite fit single word, though, and I don't see any easy way
to derive the high word without computing a full product, either q * d_1
(no gain over submul_1) or q' * d_1 followed by some hopefully cheap
adjustments. And we need the correct high word to be able to check for
underflow after submul_1.

Alternatively, we could aim to reduce submul_1 count by 2, then we need
the remainder 

  R = {u_1, u_0, 0} - q {d_1, d_0}

That seems a bit difficult to make efficient. First, unlike udiv_3by2,
this remainder might be negative, so we need three limbs (upper limb
always 0 or -1). If we adjust the low limb incrementally, we're
basically back to udiv_3by2. What we might do is to record the net q
adjustments (based on the divappr_2 conditions touching r1 only), do an
unlikely branch for the case of a net +1, and try to get R as cheaply as
possible when the adjustment is 0 or -1.

A third alternative might be to aim for returning

   {u_1, u_0} - q d_1 - mulhi (q, d_0)

which is closer to the r1 value as used divappr_2. Would fit reasonably
with the structure. I see two problems that need cheap solutions:
Handling the difference between mulhi(q, d_0) and mulhi(q', d0), and
getting underflow correctly in the unlikely case that this quantity is
negative.

> I believe we could find CPUs (mainly low-end and obsolete hig-end ones)
> where the old code will beat the new code because of the old code's
> lower submul_1 size argument.

Any particular machines you have in mind? Would be nice with some
numbers.

> Micrpoptimisation: Replace q1++ with n2+1 in add_ssaaaa.

Nice! Should save one cycle.

> Could the most significant limb of the partial remainder be kept in a
> scalar between iterations?

I think that makes sense if and only if we find a good way to reduce the
submul_1 count by one.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.