speed of unbalanced division

Mon Feb 4 21:53:58 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > I expect it to quite simple to greatly improve the situation with some
  > analysis. To get the curves perfectly smooth might require more work.

  I think a reasonable strategy would be to check if qn < dn, and in that
  case drop the low dn - qn - 1 limbs from both u and d when computing the
  quotient. And then do a final adjustment including all limbs. I think
  this should be applied to all sizes, except for small enough sizes that
  all subroutines are quadratic.

The qn < dn case is where we perform very nicely.  We already do what
you suggest, except that we (typically) never look at the low limbs.

(If we are to compute the remainder, we will apply a different strategy,
since then we cannot ignore low limbs.)

The qn > dn case is where the current code performs badly.

There, we now choose to compute an inverse of size in = ceil(dn/2).
Since qn > dn, we will have 2in < qn, which means that we will need to
develop quotient limbs in more than 2 blocks; the code computes a most
significant block (qn - 2in) quotient limbs, and then two blocks of (in)
quotient limbs each.  (The final block uses mpn_preinv_mu_divappr_q.)

The inordinate cost we see emanates from the computation of the qn - 2in
block.  Things get back to normal when qn >> dn, where we compute an
initial block is (qn - in) quotient limbs, and the *one* block of (in)
quotient limbs.

For qn = 100000 we have a slowdown of about 55% (compare nn=200000 with
nn=174500).

The fix is choosing the inverse size less stupidly, and organising the
computation in more similarly sized quotient blocks.  It seems this will
require more than just computing (in) differently, it will require some
hacking.

PS. I wrote the code in question.  I would never call somebody else's
stupid code for stupid.  :-)

-- 
Torbjörn