Sandybridge addmul_N challenge
Niels Möller
nisse at lysator.liu.se
Thu Feb 23 21:00:41 CET 2012
Torbjorn Granlund <tg at gmplib.org> writes:
> In loopmixer or manually? I wouldn't draw any conclusions without
> mixing the code first...
With the loop mixer.
> Meaning evaluating in +1 instead of -1, I assume.
Exactly.
> Did you compute the recurrency chain? Annotating the instructions on
> the recurrency chain helps understanding the problem.
I tried. I use this iteration
mov (up, n, 8), %rax
mov %rax, u
mul v0
mov %rax, r0
mov %rdx, r1
mov u, %rax
mul v1
mov (rp, n, 8), t
add t, r0
add %rax, r1
mov %rdx, r2
adc $0, r2
add c0, r0
mov r0, (rp, n, 8)
adc c1, r1
adc $0, r2
For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
Let's write the interesting instructions out and unroll twice (using
different registers),
add c0, r0 C 0 6
adc c1, r1 C 1 7
adc $0, r2 C 3 9
add r1, c2 C 3 9
adc r2, c0 C 4 10
adc $0, c1 C 6 12
So the recurrency, for one iteration, seems to be just 3 cycles. But the
loop mixer doesn't find anything faster then 6.36 cycles for one
iteration, or 3.18 per limb product. Which isn't too bad (a slight
improvement over 3.24, which I think is the best reported earlier), but
stubbornly above 3 c/l.
> My experience of Sandybridge is that with load/store coding style, the
> CPU typically executes 3 insn/cycle unless there is a recurrency
> dependency stopping that.
If we could get there, the above loop should run just below 3 c/l.
> I don't think there are. These instructions are mostly FP plus narrow
> integer ops.
Hmm. Last time I looked at that was in a 32-bit context. There's a
32x32->64 instruction which might be useful for a 32-bit build, at least
in theory, but as far as I can find in the manual, the latest
ss*-extensions don't provide any wider multiplication than that.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list