More div_qr_2 assembler
Niels Möller
nisse at lysator.liu.se
Thu Mar 31 12:45:02 CEST 2011
Torbjorn Granlund <tg at gmplib.org> writes:
> How can you time all three using the same speed command, given that the
> 2u variant don't accept the same divisors as the other two?
SPEED_ROUTINE_MPN_DIV_QR_2 modifies the divisor to make sure it's
normalized (for measuring 2n) or unnormalized (fur 2u). So only
div_qr_2n is directly comparable to divrem_2.
> To fairly compare divrem_2--now used for all divisors but with
> preshifting of the operands--to the new shift-on-the-fly code, I think
> we should include the time for shifting. Perhaps add a new target to
> tune/speed doing exactly that?
A divrem_2u target? Could add that. Or one could compare with the cycles
for mpn_lshift. Which seems to be 1.25 or so on the same machine,
indicating that div_qr_2u should win over shift + divrem_2 by a small
margin.
Is it easy to benchmark the top-level mpn div functions using speed?
> The slowdown for normalised operands is more worrying.
Are you talking about the per-limb cost or the absolute numbers? For the
latter, the comparison is a bit unfair. The normalized case generates
n-2 quotient limbs (and a fairly cheap single bit qh), while the
unnormalized case generates n-1 quotient limbs using 3/2 division. And
that seems like a fundamental difference.
> SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
> slow but not superslow SHLD. (Of the 64-bit processors, Intel's atom
> and VIA's nano have superslow SHLD.)
I haven't looked into eliminating shld yet. I'm afraid it might increase
register pressure.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list