div_qr_1 interface

Torbjorn Granlund tg at gmplib.org
Mon Oct 21 14:49:40 CEST 2013

I looked at the logic following this:

        sbb     U2, U2          C 7 13

You negate the U2 copy in Q2.  It seems that three adc by sbb
could avoid the neg.

I might also be possible to replace the early loop "and" stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips.  (IIRC, only sandybridge, ivybridge,
haswell runs that well.)

  But one variable must be moved out of the registers. Maybe B2md (used
  once) is the best candidate. Then
  	lea	(U0, B2md), U1O
  would have to be replaced by
  	mov	(%rsp), U1O	C Can be done very early
          add	U0, U1O
  We then have 26 instructions + loop overhead, or 54 instructions for 2
  iterations. Or possibly DINV, if one thinks the quotient logic is less
Reading from a stack slot costs nothing under ideal circumstances.

To optimise register usage, I sometimes annotate the code with live
ranges for each register.  That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)


More information about the gmp-devel mailing list