More div_qr_2 assembler
tg at gmplib.org
Thu Mar 31 17:19:46 CEST 2011
nisse at lysator.liu.se (Niels Möller) writes:
> To fairly compare divrem_2--now used for all divisors but with
> preshifting of the operands--to the new shift-on-the-fly code, I think
> we should include the time for shifting. Perhaps add a new target to
> tune/speed doing exactly that?
A divrem_2u target? Could add that. Or one could compare with the cycles
for mpn_lshift. Which seems to be 1.25 or so on the same machine,
indicating that div_qr_2u should win over shift + divrem_2 by a small
Yes, kind of a divrem_2u target, as a sanity check that things get
faster (or "non-slower").
Is it easy to benchmark the top-level mpn div functions using speed?
Sorry, I don't get this question...
> The slowdown for normalised operands is more worrying.
Are you talking about the per-limb cost or the absolute numbers? For the
latter, the comparison is a bit unfair. The normalized case generates
n-2 quotient limbs (and a fairly cheap single bit qh), while the
unnormalized case generates n-1 quotient limbs using 3/2 division. And
that seems like a fundamental difference.
I compare divrem_2 and mpn_div_qr_2n. Why is the latter slower fo small
operands? Don't they compute the same thing, with the difference that
the latter want an inverse passed-in, while the former computes it
> SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
> slow but not superslow SHLD. (Of the 64-bit processors, Intel's atom
> and VIA's nano have superslow SHLD.)
I haven't looked into eliminating shld yet. I'm afraid it might increase
It will cost an extra register, for sure. Maybe two.
More information about the gmp-devel