More div_qr_2 assembler

Thu Mar 31 21:55:32 CEST 2011

Torbjorn Granlund <tg at gmplib.org> writes:

> I compare divrem_2 and mpn_div_qr_2n.  Why is the latter slower fo small
> operands?

I see a few obvious differences:

1. There's an extra function call (speed mpn_div_qr_2n calls
   mpn_div_qr_2 which reads d1, d0, checks the high bit, computes the
   inverse, and then calls mpn_div_qr_2n_pi1).

2. Both call invert_limb to compute a 2/1 inverse, but the adjustments
   to a 3/2 inverse is done in assembler in divrem_2, while div_qr_2
   uses the C macro invert_pi1.

3. The argument list is longer for mpn_div_qr_2n_pi1 than for divrem_2, 

             (mp_ptr qp, mp_ptr rp, 
              mp_srcptr np, mp_size_t nn,
	      mp_limb_t d1, mp_limb_t d0, mp_limb_t di)
   vs
             (mp_ptr qp, mp_size_t qxn,
	      mp_ptr np, mp_size_t nn,
	      mp_srcptr dp)

   The di argument is passed on the stack for x86_64. And passing rp
   means that one additional register must be saved and restored.

I don't know which is the worst culprit, nor if these 3 are enough to
explain the 15 cycle difference. And then the handling of qh is
a bit different as well.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.