[PATCH] T3/T4 sparc shifts, plus more timings

Tue Mar 26 21:18:26 CET 2013

David Miller <davem at davemloft.net> writes:

  L(top):
          or      %g4, %g1, %l1
          sllx    %g2, cnt, %g1

          srlx    %g2, tcnt, %g4
          ldx     [up - 8], %g2

          stx     %l1, [rp - 8]
          or      %g3, %l2, %l7

          sllx    %g5, cnt, %l2
          srlx    %g5, tcnt, %g3

          ldx     [up - 16], %g5
          sub     up, 16, up

          stx     %l7, [rp - 16]
          sub     rp, 16, rp

          brgz    n, L(top)
           add    n, -2, n

It has lost some symmetry, which would be nice to keep.  Is it slower
in the operation order I suggested?

The advantage of symmetry is readability, and that it becomes quite
simple to add two more ways, for 4-way unrolling.

A structure like

BLOCK(0)
BLOCK(1)
...
bookeeping

where BLOCK(k) and BLOCK(l) are similar, just with offsets and regnums
replaced.

I realise that the blocks here have an odd number of insns, meaning that
they won't issue identically with that layout.  That might make no
internal BLOCK layout optimal.  We could fix that by (with 4-ay
unrolling) give each block one overhead insn each, giving BLOCK(3) the
branch, putting its final insn in the delay slot.

  And verified quickly using an ad-hoc test program that each iteration
  executes in 7 cycles, which is 3.5 cycles per limb.

5 cycles for actual work, 2 cycles for bookkeeping, i.e., bookkeeping
adds 40%.  Any unrolling will need 2 cycles for bookkeeping, but we can
get more work done with 4-way or even 8-way unrolling.

  The only disappointment is that we are just one register shy of being
  able to avoid allocating a register window. :-/ I was initially
  thinking that I could avoid the register window if we get rid of 'n'
  and just use a comparison against 'up' or 'rp' as the loop condition,
  but that doesn't work since we need a 'up_base' or similar in another
  register for the comparison, nullifying our gains.

I used new registers judiciously.  Clearly, the 'or' result registers
could be coalesced.

-- 
Torbjörn