More div_qr_2 assembler

Thu Mar 31 12:45:02 CEST 2011

Torbjorn Granlund <tg at gmplib.org> writes:

> How can you time all three using the same speed command, given that the
> 2u variant don't accept the same divisors as the other two?

SPEED_ROUTINE_MPN_DIV_QR_2 modifies the divisor to make sure it's
normalized (for measuring 2n) or unnormalized (fur 2u). So only
div_qr_2n is directly comparable to divrem_2.

> To fairly compare divrem_2--now used for all divisors but with
> preshifting of the operands--to the new shift-on-the-fly code, I think
> we should include the time for shifting.  Perhaps add a new target to
> tune/speed doing exactly that?

A divrem_2u target? Could add that. Or one could compare with the cycles
for mpn_lshift. Which seems to be 1.25 or so on the same machine,
indicating that div_qr_2u should win over shift + divrem_2 by a small
margin.

Is it easy to benchmark the top-level mpn div functions using speed?

> The slowdown for normalised operands is more worrying.

Are you talking about the per-limb cost or the absolute numbers? For the
latter, the comparison is a bit unfair. The normalized case generates
n-2 quotient limbs (and a fairly cheap single bit qh), while the
unnormalized case generates n-1 quotient limbs using 3/2 division. And
that seems like a fundamental difference.

> SHLD_SLOW means "SHLD_SUPERSLOW"; some machines, e.g., AMD K8-K10, have
> slow but not superslow SHLD.  (Of the 64-bit processors, Intel's atom
> and VIA's nano have superslow SHLD.)

I haven't looked into eliminating shld yet. I'm afraid it might increase
register pressure.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.