[PATCH] T3/T4 sparc shifts, plus more timings

Thu Mar 28 10:26:51 CET 2013

David Miller <davem at davemloft.net> writes:

  > A clever trick!  But you will probably get 2.75 c/l for 4-way, not 2.5
  > c/l.  We'll need infinite unrolling for 2.5...

  Hmmm, not sure I understand.  But anyways, it turns out we don't need to
  use the fused compare-and-branch, here is what the 2-way loop looks like:

There are 5 insns per limb, right?  Then 2 bookkeping insn per
iteration.  So, for k-way unrolling, we have 5k+2 insn, and with 2-way
sustained execution we will get (5k+2)/k/2 c/l for k-way unrolling.

 k     c/l
 1     3.50
 2     3.00
 3     2.83
 4     2.75
 5     2.70
 6     2.67
 7     2.64
 8     2.62

  1:
          sllx    u0, cnt, %l2
          or      %l4, %l3, r0

          ldx     [n + u_off0], u0
          srlx    u1, tcnt, %l5

          stx     r0, [n + r_off0]
          sllx    u1, cnt, %l3

          or      %l2, %l5, r1
          ldx     [n + u_off1], u1

          subcc   n, (2 * 8), n
          srlx    u0, tcnt, %l4

          bge,pt  %xcc, 1b
           stx    r1, [n + r_off0]

  Which is 6 cycles per loop, 3 cycles per limb.

Not bad for a tiny loop!

  On UltraSparc-1 et al. we could shuffle things around and get it
  to schedule like this:

  1:
          sllx    u0, cnt, %l2
          or      %l4, %l3, r0
          ldx     [n + u_off0], u0

          srlx    u1, tcnt, %l5
          stx     r0, [n + r_off0]
          subcc   n, (2 * 8), n

          sllx    u1, cnt, %l3
          or      %l2, %l5, r1
          ldx     [n + u_off1], u1

          srlx    u0, tcnt, %l4
          bge,pt  %xcc, 1b
           stx    r1, [n + r_off0]

  The usual rules about these older chips applies, one shift per group,
  when there are multiple integer instructions in a group the shift
  must appear first.  Loads have a 2 cycle latency, and the above loop
  places the read dependency 3 cycles after the load.

  Unless I've made a mistake that's 4 cycles per loop, 2 cycles per
  limb.  That's what the current 4-way unrolling code obtains.  In fact
  I think that a 4-way unrolling using the above bookkeeping
  simplification wouldn't execute any faster, because we'll have a
  troublesome odd instruction at the end of the loop.

Replacing the old sparc64 code with smaller, faster code would be nice.

I suppose the new T3/T4 addmul_2 and mul_2 code could also get a cycle
shaved off with the multi-pointer trick.  Actually, I think one could do
three things with it:

1. Schedule mulx,umulhi better (if they really have a 12 cycle latency)
2. Use the multi-pointer trick (saves 2 insn/iteration)
3. Use a less regular structure to save some carry propagations (saves 2
   insn/iteration)

I believe 4-way addmul_2 could be made to run at 3.25 c/l, while mul_2
could run at 2.25 c/l.  And while GMP's configure will now have made
addmul_2 the main workhorse for mul_basecase and sqr_basecase, we should
also improve addmul_1.  A 4-way addmul_1 will need 31 insn and should
run at 4 c/l.  A 4-way mul_1 will need 22 insns, or 2.27 c/l.

-- 
Torbjörn