Sandybridge addmul_N challenge

Wed Feb 22 16:30:55 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  The recurrency depth seems to be the same in all cases, though, with a
  latency of add + adc + adc from cin to cout. If that's what's killing
  performance, maybe this would be better,

The x86_64/mul_1.asm code has adc+adc with register operands on the
recurrency chain, i.e., 2 cycles on any AMD chip and 4 cycles on any
Intel chip (except Pentium4 which has about 20 cycles).

    mov   (up), %rax
    mul	v
    xor	cout, cout
    add	(rp), %rax
    adc	%rdx, cout
    add	%rax, cin
    adc	$0, cout
    mov	cin, (rp)

  with a recurrency latency of only add + adc (where the adc in question
  has a $0 source operand). If I understand you correctly, that would be
  only two cycles on sandybridge. It seems the current sbr code does
  something similar? Then we have to rely on scheduling or out-of-order
  execution to not get a useless wait between the mul and the first add.

I think your code has indeed a 2c recurrency path.

It is similar to the sandybridge code, except that it turned out to be
better to use adc $0 in both places possible.

  Is instruction issue still limited to three instructions per cycle? Then
  that's a more narrow bottleneck than both latency and multiplication
  throughput. The above is eight instructions per limb, excluding looping.
  So one could at least *hope* to get down to 3 cycles with a moderate
  level of unrolling.

I don't think Intel's processors like "op mem,reg" very much.  I tried
the code above (with the loop mixer) and its runs slower than the code I
checked in (which runs slightly faster than claimed, 3.25 c/l, mot 3.4
c/l).

We still have 3 insn/c, except that some adjacent insn pairs are fused.
I don't know exactly which, but test+jcc and cmp+jcc might be the only
ones.

My new addmul_1 has 38 insn, no fusing expected so 3.17 c/l is a decoder
imposed limit.

  Maybe one could try a variant updating (rp) twice, to save an
  instruction:

    mov   (up), %rax
    mul	v
    xor	cout, cout
    add	%rax, (rp)
    adc	%rdx, cout
    add	cin, (rp)
    adc	$0, cout

  Is it possible to squeeze those three memory instructions in less than 3
  cycles?

Intel processors likes "op reg,mem" even less...

  If we try to get the instruction count below 9 (like above), then
  there's not much room for moving around %rax and %rdx. But the critical
  recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx,
  leaving a bit freedom to move it between iterations if we unroll and use
  multiple registers for cin and cout. Hmm, let me give one more variant,
  which moves %rax and %rdx out of the way, at the cost of one more
  instruction (8 instructions per limb):

    mov   (up), %rax
    mul	v
    mov	%rax, pl
    mov	%rdx, cout
    add	pl, (rp)
    adc	$0, cout
    add	cin, (rp)
    adc	$0, cout

  I guess one ought to bring out the loop mixer to find out if any of this
  really can run at 3 cycles.

The sandybridge machine tom behind shell is waiting for you.  A nehalem
machine is biko* (lots of Xen machines on the same hardware) and
repentium is a Conroe ("core2").

I recently played with karatsuba-based addmul_2 for s390.  For MUL-
challenged machines, that might be an option.  Sandybridge is OK there,
but Bull-dozer is not.  A disadvantage is that such code would be
unsuitable for code which wants to be "side-channel silent".

-- 
Torbjörn