div_qr_1 interface
Torbjorn Granlund
tg at gmplib.org
Mon Oct 21 14:49:40 CEST 2013
I looked at the logic following this:
sbb U2, U2 C 7 13
You negate the U2 copy in Q2. It seems that three adc by sbb
could avoid the neg.
I might also be possible to replace the early loop "and" stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips. (IIRC, only sandybridge, ivybridge,
haswell runs that well.)
But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then
lea (U0, B2md), U1O
would have to be replaced by
mov (%rsp), U1O C Can be done very early
...
add U0, U1O
We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.
Reading from a stack slot costs nothing under ideal circumstances.
To optimise register usage, I sometimes annotate the code with live
ranges for each register. That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)
--
Torbjörn
More information about the gmp-devel
mailing list