Sandybridge addmul_N challenge
Torbjorn Granlund
tg at gmplib.org
Wed Feb 22 16:30:55 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
The recurrency depth seems to be the same in all cases, though, with a
latency of add + adc + adc from cin to cout. If that's what's killing
performance, maybe this would be better,
The x86_64/mul_1.asm code has adc+adc with register operands on the
recurrency chain, i.e., 2 cycles on any AMD chip and 4 cycles on any
Intel chip (except Pentium4 which has about 20 cycles).
mov (up), %rax
mul v
xor cout, cout
add (rp), %rax
adc %rdx, cout
add %rax, cin
adc $0, cout
mov cin, (rp)
with a recurrency latency of only add + adc (where the adc in question
has a $0 source operand). If I understand you correctly, that would be
only two cycles on sandybridge. It seems the current sbr code does
something similar? Then we have to rely on scheduling or out-of-order
execution to not get a useless wait between the mul and the first add.
I think your code has indeed a 2c recurrency path.
It is similar to the sandybridge code, except that it turned out to be
better to use adc $0 in both places possible.
Is instruction issue still limited to three instructions per cycle? Then
that's a more narrow bottleneck than both latency and multiplication
throughput. The above is eight instructions per limb, excluding looping.
So one could at least *hope* to get down to 3 cycles with a moderate
level of unrolling.
I don't think Intel's processors like "op mem,reg" very much. I tried
the code above (with the loop mixer) and its runs slower than the code I
checked in (which runs slightly faster than claimed, 3.25 c/l, mot 3.4
c/l).
We still have 3 insn/c, except that some adjacent insn pairs are fused.
I don't know exactly which, but test+jcc and cmp+jcc might be the only
ones.
My new addmul_1 has 38 insn, no fusing expected so 3.17 c/l is a decoder
imposed limit.
Maybe one could try a variant updating (rp) twice, to save an
instruction:
mov (up), %rax
mul v
xor cout, cout
add %rax, (rp)
adc %rdx, cout
add cin, (rp)
adc $0, cout
Is it possible to squeeze those three memory instructions in less than 3
cycles?
Intel processors likes "op reg,mem" even less...
If we try to get the instruction count below 9 (like above), then
there's not much room for moving around %rax and %rdx. But the critical
recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
leaving a bit freedom to move it between iterations if we unroll and use
multiple registers for cin and cout. Hmm, let me give one more variant,
which moves %rax and %rdx out of the way, at the cost of one more
instruction (8 instructions per limb):
mov (up), %rax
mul v
mov %rax, pl
mov %rdx, cout
add pl, (rp)
adc $0, cout
add cin, (rp)
adc $0, cout
I guess one ought to bring out the loop mixer to find out if any of this
really can run at 3 cycles.
The sandybridge machine tom behind shell is waiting for you. A nehalem
machine is biko* (lots of Xen machines on the same hardware) and
repentium is a Conroe ("core2").
I recently played with karatsuba-based addmul_2 for s390. For MUL-
challenged machines, that might be an option. Sandybridge is OK there,
but Bull-dozer is not. A disadvantage is that such code would be
unsuitable for code which wants to be "side-channel silent".
--
Torbjörn
More information about the gmp-devel
mailing list