Sandybridge addmul_N challenge
Niels Möller
nisse at lysator.liu.se
Wed Feb 22 11:18:52 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> But our multiplication loops for Sandybridge are much worse. The
> current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l. I
> have new code running at 3.4 c/l and 3.3 c/l respectively.
Let me think aloud for a bit.
I'm still not sure what the addmul_1 bottleneck is. I note that in the
general x86_64 and core2 code, for each mul, we do two *dependent* adds
before storing the low limb. With the cin and cout for the carrry limbs,
the operations per limb are basically
mov (up), %rax
mul v
add %rax, cin
xor cout, cout
adc $0, cout
add cin, (rp)
adc %rdx, cout
The real code is a lot more clever with unrolling and scheduling, but if
I read it correctly, these are the operations done.
One could reorder this as follows,
mov (up), %rax
mul v
xor cout, cout
add (rp), cin
adc $0, cout
add %rax, cin
adc %rdx, cout
mov cin, (rp)
Then we get a bit more independence of operations, since the first add
is independent of the result of the mul. It costs an instruction with
the extra mov to rp. But I suspect that's still better than updating
memory twice, like
mov (up), %rax
mul v
xor cout, cout
add cin, (rp)
adc $0, cout
add %rax, (rp)
adc %rdx, cout
The recurrency depth seems to be the same in all cases, though, with a
latency of add + adc + adc from cin to cout. If that's what's killing
performance, maybe this would be better,
mov (up), %rax
mul v
xor cout, cout
add (rp), %rax
adc %rdx, cout
add %rax, cin
adc $0, cout
mov cin, (rp)
with a recurrency latency of only add + adc (where the adc in question
has a $0 source operand). If I understand you correctly, that would be
only two cycles on sandybridge. It seems the current sbr code does
something similar? Then we have to rely on scheduling or out-of-order
execution to not get a useless wait between the mul and the first add.
Is instruction issue still limited to three instructions per cycle? Then
that's a more narrow bottleneck than both latency and multiplication
throughput. The above is eight instructions per limb, excluding looping.
So one could at least *hope* to get down to 3 cycles with a moderate
level of unrolling.
Maybe one could try a variant updating (rp) twice, to save an
instruction:
mov (up), %rax
mul v
xor cout, cout
add %rax, (rp)
adc %rdx, cout
add cin, (rp)
adc $0, cout
Is it possible to squeeze those three memory instructions in less than 3
cycles?
If we try to get the instruction count below 9 (like above), then
there's not much room for moving around %rax and %rdx. But the critical
recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
leaving a bit freedom to move it between iterations if we unroll and use
multiple registers for cin and cout. Hmm, let me give one more variant,
which moves %rax and %rdx out of the way, at the cost of one more
instruction (8 instructions per limb):
mov (up), %rax
mul v
mov %rax, pl
mov %rdx, cout
add pl, (rp)
adc $0, cout
add cin, (rp)
adc $0, cout
I guess one ought to bring out the loop mixer to find out if any of this
really can run at 3 cycles.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list