More div_qr_2 assembler

Thu Mar 31 17:19:46 CEST 2011

nisse at lysator.liu.se (Niels Möller) writes:

  > To fairly compare divrem_2--now used for all divisors but with
  > preshifting of the operands--to the new shift-on-the-fly code, I think
  > we should include the time for shifting.  Perhaps add a new target to
  > tune/speed doing exactly that?

  A divrem_2u target? Could add that. Or one could compare with the cycles
  for mpn_lshift. Which seems to be 1.25 or so on the same machine,
  indicating that div_qr_2u should win over shift + divrem_2 by a small
  margin.

Yes, kind of a divrem_2u target, as a sanity check that things get
faster (or "non-slower").

  Is it easy to benchmark the top-level mpn div functions using speed?

Sorry, I don't get this question...

  > The slowdown for normalised operands is more worrying.

  Are you talking about the per-limb cost or the absolute numbers? For the
  latter, the comparison is a bit unfair. The normalized case generates
  n-2 quotient limbs (and a fairly cheap single bit qh), while the
  unnormalized case generates n-1 quotient limbs using 3/2 division. And
  that seems like a fundamental difference.

I compare divrem_2 and mpn_div_qr_2n.  Why is the latter slower fo small
operands?  Don't they compute the same thing, with the difference that
the latter want an inverse passed-in, while the former computes it
locally?

  > SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
  > slow but not superslow SHLD.  (Of the 64-bit processors, Intel's atom
  > and VIA's nano have superslow SHLD.)

  I haven't looked into eliminating shld yet. I'm afraid it might increase
  register pressure.

It will cost an extra register, for sure.  Maybe two.

-- 
Torbjörn