udiv_qr_3by2 vs divappr

Thu Sep 20 10:50:39 UTC 2018

paul zimmermann <Paul.Zimmermann at inria.fr> writes:

  1) those with only O(1) preprocessing time, like the precomputation of an
     inverse of the 1 or 2 uppers limbs of the divisor

  2) those with O(n) preprocessing time, like Svoboda division

And (2) is perhaps OK earlier since it saves O(n) time in the main
quotient limb development loop.

I suppose a "naive" implementation scales the full dividend and the full
divisor by a factor which fits in a limb.  One could in theory save some
time by scaling just their O(log(Q)) most significant limb.

  > Are you thinking of Svoboda division here?  That's another possibility
  > for speeding up schoolbook division.

  yes, in Svoboda division the quotient selection is for free. But you still
  have to wait the end of the previous submul_1 (or submul_2) call to access
  the next quotient, whereas in mpn_mul_basecase all multipliers are known in
  advance. Does this makes a difference?

How likely does Svoboda overesimate a quotient limb?

The trick I'm looking into now does not fully serialise consecutive
submul_1 calls except in the unlikely case that a quotient limb is
overestimated.  It will be overestimated more often than in our present
code, but I haven't analysed how much more often.  It needs to be quite
rare for the algorithm to be good.  To keep it low, one could move
slightly more work for the initial submul_1, not just 2 as in my example
code.

Note that a real implementation should be able to use the intermediate
results of udiv_qr_3by2 (or divappr) to make the initial submul_1 very
simple; I assume just one or two extra mul + some sub instructions will
be needed.

-- 
Torbjörn
Please encrypt, key id 0xC8601622