addmul_k with toom

Mon Jun 19 13:20:56 UTC 2017

tg at gmplib.org (Torbjörn Granlund) writes:

  I timed pmuludq now, and their throughput is 1 every 2nd cycle, i.e.,
  one *should* be able to beat 54 c/l for addmul_2 quite soundly.  4
  pmuludq plus some shuffling will give a 64-bit addmul_2.

Some more experimemnts later.

Two pipeline quirks stand in the way of good atom64 SSE2 performance:

1. The 2 x 64-bit paddq instruction has a throughput of 1/5 and a
latency of 5.  (This is in stark contrast to the 4 x 32-bit paddd which
have throughput of 2 per cycle and a latency of 1.)

2. The movq xmm => greg instruction is also really slow.

Working around these limitations, one needs to use movdqa via the stack
and then perform a 3-way add in the general register.  I haven't worked
out the finer details, but the instruction mix seem to run at ~10 c/l
for addmul_1.  (This does not translate to a fast atom32 addmul_2 as it
depends on 64-bit add/adc, but even using twice as many 32-bit add/adc,
this should be an atom32 improvement.)

The loop I have in mind with just the SSE instructions:

1:
        punpckldq (%rsi), %xmm1
        pmuludq   %xmm7, %xmm0
        movdqa    %xmm2, 0(%rbp)

        punpckhdq (%rsi), %xmm2
        pmuludq   %xmm7, %xmm1
        movdqa    %xmm3, 16(%rbp)

        punpckldq 16(%rsi), %xmm3
        pmuludq   %xmm7, %xmm2
        movdqa    %xmm0, 32(%rbp)

        punpckhdq 16(%rsi), %xmm0
        pmuludq   %xmm7, %xmm3
        movdqa    %xmm1, 48(%rbp)

        lea       32(%rsi), %rsi
        dec       %rcx
        jnz       1b

One will need properly scheduled add, adc, adc, mov to complete
addmul_1, or just adc, mov to complete mul_1.

A problem with this code is that punpckhdq/punpckldq requires 128-bit
alignment whereas we only will have 64-bit alignment for atom64.  One
will need a conditional initial step as well as a conditional final
step, alternatively load separately using movdqu and have
punpckhdq/punpckldq read from the loaded register.  An cool idea: It
should be possible to select to start with either punpckldq or punpckhdq
depending on alignment, simply rounding down address (using and $-16,
%rsi) before the loop.

(There is a false dependency created by punpckhdq/punpckldq, one could
break that by moving some garbage value into the target register before,
but I suspect that's going to be slower.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622