div_qr_1 interface
Niels Möller
nisse at lysator.liu.se
Mon Oct 21 13:56:19 CEST 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> On Intel chips, op-to-mem is expensive. Even op-from-memory is often
> slower than load+op. (I understand the register shortage problem.)
The following (untested) variant needs one register too many.
UP, QP, UN: Load, store, loop counter.
DINV, B2, B2md: Loop invariant constants.
U2, U1I, U0, Q1I, Q0: Inputs.
U1O, Q1O: Outputs.
Q2, %rax, %rdx: Locals.
Also U1I -> U1O recurrency chain (with opteron cycle counts)
mov U2, Q2
mov U2, Q1O
neg Q2
mov DINV, %rax
and DINV, Q1O
mul U1I
add Q0, Q1O
adc $0, Q2
mov %rax, Q0
add %rdx, Q1O
adc $0, Q2
mov B2, %rax
and B2, U2
mul U1I C 0 6
lea (U0, B2md), U1O
add U0, U2
cmovnc U2, U1O
adc U1I, Q1O
adc Q1I, Q2
mov Q2, (QP, UN, 8)
jc L(incr)
L(incr_done):
mov (UP, UN, 8), U0
add %rax, U0 C 4
adc %rdx, U1O C 5
sbb U2, U2 C 6
25 instructions (27 K10 decoder slots) excluding loop overhead.
But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then
lea (U0, B2md), U1O
would have to be replaced by
mov (%rsp), U1O C Can be done very early
...
add U0, U1O
We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list