div_qr_1 interface

Mon Oct 21 14:49:40 CEST 2013

I looked at the logic following this:

        sbb     U2, U2          C 7 13

You negate the U2 copy in Q2.  It seems that three adc by sbb
could avoid the neg.

I might also be possible to replace the early loop "and" stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips.  (IIRC, only sandybridge, ivybridge,
haswell runs that well.)

  But one variable must be moved out of the registers. Maybe B2md (used
  once) is the best candidate. Then

  	lea	(U0, B2md), U1O

  would have to be replaced by

  	mov	(%rsp), U1O	C Can be done very early
          ...
          add	U0, U1O

  We then have 26 instructions + loop overhead, or 54 instructions for 2
  iterations. Or possibly DINV, if one thinks the quotient logic is less
  critical.

Reading from a stack slot costs nothing under ideal circumstances.

To optimise register usage, I sometimes annotate the code with live
ranges for each register.  That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)

-- 
Torbjörn