udiv_qr_3by2 vs divappr

Wed Sep 19 16:18:29 UTC 2018

This is a sketch of an idea which I have had for a long time, which we
actually might have discussed here before.  The idea is to cut a large
O(nn-dn) time off of schoolbook division.

Normally, the time taken by schoolbook division is latency(mpn_submul_1)
+ 2*latency(mul) + C.  Here C is the number of plain operations of
quotient limb approximation and mul means a scalar multiply used by the
same.

Couldn't we make the quotient limb approximation in parallel with the
O(n^2) work?  I think we can, but it requires that we run the submul_1
partially backwards, and thereby produce the most significant partial
remainder limbs early.

But that's impossible, right?  The quotient limb approximation is, eh,
approximate after all!  We cannot know the most significant partial
remainder limbs as early as required!  No, we cannot know them for sure,
but we can almost always tell.  And when we're wrong, we need to add
back and recompute.  Below is a very rough implementation.  (It surely is
not correct wrt indexes.)

The hope is that this could run almost as as fast as mul_basecase.
There, and in this proposed algorithm, subsequent addmul_1/submul_1 can
actually overlap with the help of OoO execution.  So hopefully on good
OoO pipelines this could not just hide the division limb approximation
latency, but actually hide much of the feed-in and wind-down latency of
submul_1!

qnext = divappr (np[1], np[0], d1, d0);
for (...)
  {
    // Apply q to most significant partial remainder limbs
    cy = submul_1 (np - 2, dp + dn - 2, qnext, 2);

    q = qnext;
    // Now, with the most significant few limbs of the next partial remainder
    // computed to some accuracy, compute q for the next iteration.  This might
    // overestimate q worse than with the old algorithms.
    qnext = divappr (np[1], np[0], d1, d0);

    // Apply q to all but the most significant partial remainder limbs.
    cy = submul_1 (np - dn, dp, q, dn - 2);

    if (UNLIKELY (cy > np[0]))
      {
        np[0] -= cy;
	// q was too large, qnext is garbage so recompute it!
	// (We could perhaps recompute it more cleverly.)
	ASSERT_CARRY(mpn_add_n (np - dn, np - dn, dp, dn));
	qnext = divappr (np[1], np[0], d1, d0);
      }
    else
      np[0] -= cy;
    np--;
  }

-- 
Torbjörn
Please encrypt, key id 0xC8601622