More div_qr_2 assembler
Niels Möller
nisse at lysator.liu.se
Thu Mar 31 21:55:32 CEST 2011
Torbjorn Granlund <tg at gmplib.org> writes:
> I compare divrem_2 and mpn_div_qr_2n. Why is the latter slower fo small
> operands?
I see a few obvious differences:
1. There's an extra function call (speed mpn_div_qr_2n calls
mpn_div_qr_2 which reads d1, d0, checks the high bit, computes the
inverse, and then calls mpn_div_qr_2n_pi1).
2. Both call invert_limb to compute a 2/1 inverse, but the adjustments
to a 3/2 inverse is done in assembler in divrem_2, while div_qr_2
uses the C macro invert_pi1.
3. The argument list is longer for mpn_div_qr_2n_pi1 than for divrem_2,
(mp_ptr qp, mp_ptr rp,
mp_srcptr np, mp_size_t nn,
mp_limb_t d1, mp_limb_t d0, mp_limb_t di)
vs
(mp_ptr qp, mp_size_t qxn,
mp_ptr np, mp_size_t nn,
mp_srcptr dp)
The di argument is passed on the stack for x86_64. And passing rp
means that one additional register must be saved and restored.
I don't know which is the worst culprit, nor if these 3 are enough to
explain the 15 cycle difference. And then the handling of qh is
a bit different as well.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list