addmul_k with toom
Torbjörn Granlund
tg at gmplib.org
Mon Jun 19 13:20:56 UTC 2017
tg at gmplib.org (Torbjörn Granlund) writes:
I timed pmuludq now, and their throughput is 1 every 2nd cycle, i.e.,
one *should* be able to beat 54 c/l for addmul_2 quite soundly. 4
pmuludq plus some shuffling will give a 64-bit addmul_2.
Some more experimemnts later.
Two pipeline quirks stand in the way of good atom64 SSE2 performance:
1. The 2 x 64-bit paddq instruction has a throughput of 1/5 and a
latency of 5. (This is in stark contrast to the 4 x 32-bit paddd which
have throughput of 2 per cycle and a latency of 1.)
2. The movq xmm => greg instruction is also really slow.
Working around these limitations, one needs to use movdqa via the stack
and then perform a 3-way add in the general register. I haven't worked
out the finer details, but the instruction mix seem to run at ~10 c/l
for addmul_1. (This does not translate to a fast atom32 addmul_2 as it
depends on 64-bit add/adc, but even using twice as many 32-bit add/adc,
this should be an atom32 improvement.)
The loop I have in mind with just the SSE instructions:
1:
punpckldq (%rsi), %xmm1
pmuludq %xmm7, %xmm0
movdqa %xmm2, 0(%rbp)
punpckhdq (%rsi), %xmm2
pmuludq %xmm7, %xmm1
movdqa %xmm3, 16(%rbp)
punpckldq 16(%rsi), %xmm3
pmuludq %xmm7, %xmm2
movdqa %xmm0, 32(%rbp)
punpckhdq 16(%rsi), %xmm0
pmuludq %xmm7, %xmm3
movdqa %xmm1, 48(%rbp)
lea 32(%rsi), %rsi
dec %rcx
jnz 1b
One will need properly scheduled add, adc, adc, mov to complete
addmul_1, or just adc, mov to complete mul_1.
A problem with this code is that punpckhdq/punpckldq requires 128-bit
alignment whereas we only will have 64-bit alignment for atom64. One
will need a conditional initial step as well as a conditional final
step, alternatively load separately using movdqu and have
punpckhdq/punpckldq read from the loaded register. An cool idea: It
should be possible to select to start with either punpckldq or punpckhdq
depending on alignment, simply rounding down address (using and $-16,
%rsi) before the loop.
(There is a false dependency created by punpckhdq/punpckldq, one could
break that by moving some garbage value into the target register before,
but I suspect that's going to be slower.)
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list