addmul_k with toom

Sun Jun 18 21:46:02 UTC 2017

nisse at lysator.liu.se (Niels Möller) writes:

  Below is an assembly sketch, not yet working. With some luck, some
  variant of this should run in 54 cycles per iteration on Atom (limited
  by multiplication throughput, each of the three muls taking 18 cycles).
  Not sure how to think about scheduling for this processor (and it's
  little point in thinking too much about it until we have a correct
  instruction sequence).

The original Atom (the one which needs 18 cycles for a 64-bit mul insn)
is a 2-issue in order pipeline.

I am not sure the multiply runs completely unperturbed in the background
on these CPUs.  Some timing experiments suggest that mul stalls the
pipeline, at least partially.

I timed pmuludq now, and their throughput is 1 every 2nd cycle, i.e.,
one *should* be able to beat 54 c/l for addmul_2 quite soundly.  4
pmuludq plus some shuffling will give a 64-bit addmul_2.

SSE4.1 instruction are missing from this old platform, and that will
result in more shuffling/masking insn.  But given that the raw multiply
needs just 8 cycles, I expect 54 c/l will not be hard to beat.  :-)

It might be worth noticing that pmuludq-based multiply loops could be
shared between x86-32 and x86-64.  An 64-bit addmul_(k) will be a 32-bit
addmul_(2k).  At least if we stay with SSE/AVX insns and do not do some
final accumulation using plain integer ops. (The same is not true for
big-endian CPUs.)

I like the idea of a side-channel silent toom addmul_2, but it is not
obvious which platform will benefit.

-- 
Torbjörn
Please encrypt, key id 0xC8601622