Sandybridge addmul_N challenge

Thu Feb 23 21:38:27 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
  Let's write the interesting instructions out and unroll twice (using
  different registers),

          add     c0, r0		C  0  6
          adc     c1, r1		C  1  7
          adc     $0, r2		C  3  9

          add     r1, c2		C  3  9
          adc     r2, c0		C  4  10
          adc     $0, c1		C  6  12

  So the recurrency, for one iteration, seems to be just 3 cycles. But the
  loop mixer doesn't find anything faster then 6.36 cycles for one
  iteration, or 3.18 per limb product. Which isn't too bad (a slight
  improvement over 3.24, which I think is the best reported earlier), but
  stubbornly above 3 c/l.

I am playing with this block:

carry-in lo in r14
carry-in hi in rcx
        mov     0(up), %rax
        mul     v1
        mov     8(rp), %r8
        add     %rax, %r8
        mov     %rdx, %r9
        adc     $0, %r9
        mov     8(up), %rax
        mul     v0
        add     %rax, %r8
        adc     %rdx, %r9
        mov     $0, R32(%rbx)
        adc     $0, R32(%rbx)
        add     %r14, %r8               C 0
        adc     %rcx, %r9               C 1
        adc     $0, R32(%rbx)           C might be removed
        mov     %r8, 8(rp)
carry-out lo in r9
carry-out hi in rbx

This is not identical to your block, I think.  It runs at exactly 3 c/l.
The recurrency path is extremely shallow, at 1.5 c/l.

If we slightly restrict the operand range, we could remove the indicated
carry propagation insn.  Then the code runs at 2.8 c/l.

Neither is decoding bandwidth limited,

It is further possible to supplant the 'mov $0,reg' and following 'adc
$0,reg' with 'setc reg'.  This creates a false dependency (on the upper
56 bits) and seems to run at about the same speed.

The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon
if the lea pointer update insns are removed.  This is a good sign,
proving there is no magic stopping us at 3 c/l...

  > My experience of Sandybridge is that with load/store coding style, the
  > CPU typically executes 3 insn/cycle unless there is a recurrency
  > dependency stopping that.

  If we could get there, the above loop should run just below 3 c/l.

I was obviously wrong.  :-(

  Hmm. Last time I looked at that was in a 32-bit context. There's a
  32x32->64 instruction which might be useful for a 32-bit build, at least
  in theory, but as far as I can find in the manual, the latest
  ss*-extensions don't provide any wider multiplication than that.

I believe that insn is used for 32-bit builds, where it helps.  Much
improvments could be done for 32-bit builds, if one care

(I see a new mail has arrived, will read now.)

-- 
Torbjörn