Sandybridge addmul_N challenge
tg at gmplib.org
Thu Feb 23 21:38:27 CET 2012
nisse at lysator.liu.se (Niels Möller) writes:
For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
Let's write the interesting instructions out and unroll twice (using
add c0, r0 C 0 6
adc c1, r1 C 1 7
adc $0, r2 C 3 9
add r1, c2 C 3 9
adc r2, c0 C 4 10
adc $0, c1 C 6 12
So the recurrency, for one iteration, seems to be just 3 cycles. But the
loop mixer doesn't find anything faster then 6.36 cycles for one
iteration, or 3.18 per limb product. Which isn't too bad (a slight
improvement over 3.24, which I think is the best reported earlier), but
stubbornly above 3 c/l.
I am playing with this block:
carry-in lo in r14
carry-in hi in rcx
mov 0(up), %rax
mov 8(rp), %r8
add %rax, %r8
mov %rdx, %r9
adc $0, %r9
mov 8(up), %rax
add %rax, %r8
adc %rdx, %r9
mov $0, R32(%rbx)
adc $0, R32(%rbx)
add %r14, %r8 C 0
adc %rcx, %r9 C 1
adc $0, R32(%rbx) C might be removed
mov %r8, 8(rp)
carry-out lo in r9
carry-out hi in rbx
This is not identical to your block, I think. It runs at exactly 3 c/l.
The recurrency path is extremely shallow, at 1.5 c/l.
If we slightly restrict the operand range, we could remove the indicated
carry propagation insn. Then the code runs at 2.8 c/l.
Neither is decoding bandwidth limited,
It is further possible to supplant the 'mov $0,reg' and following 'adc
$0,reg' with 'setc reg'. This creates a false dependency (on the upper
56 bits) and seems to run at about the same speed.
The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon
if the lea pointer update insns are removed. This is a good sign,
proving there is no magic stopping us at 3 c/l...
> My experience of Sandybridge is that with load/store coding style, the
> CPU typically executes 3 insn/cycle unless there is a recurrency
> dependency stopping that.
If we could get there, the above loop should run just below 3 c/l.
I was obviously wrong. :-(
Hmm. Last time I looked at that was in a 32-bit context. There's a
32x32->64 instruction which might be useful for a 32-bit build, at least
in theory, but as far as I can find in the manual, the latest
ss*-extensions don't provide any wider multiplication than that.
I believe that insn is used for 32-bit builds, where it helps. Much
improvments could be done for 32-bit builds, if one care
(I see a new mail has arrived, will read now.)
More information about the gmp-devel