[PATCH] T3/T4 sparc shifts, plus more timings
tg at gmplib.org
Tue Mar 26 21:18:26 CET 2013
David Miller <davem at davemloft.net> writes:
or %g4, %g1, %l1
sllx %g2, cnt, %g1
srlx %g2, tcnt, %g4
ldx [up - 8], %g2
stx %l1, [rp - 8]
or %g3, %l2, %l7
sllx %g5, cnt, %l2
srlx %g5, tcnt, %g3
ldx [up - 16], %g5
sub up, 16, up
stx %l7, [rp - 16]
sub rp, 16, rp
brgz n, L(top)
add n, -2, n
It has lost some symmetry, which would be nice to keep. Is it slower
in the operation order I suggested?
The advantage of symmetry is readability, and that it becomes quite
simple to add two more ways, for 4-way unrolling.
A structure like
where BLOCK(k) and BLOCK(l) are similar, just with offsets and regnums
I realise that the blocks here have an odd number of insns, meaning that
they won't issue identically with that layout. That might make no
internal BLOCK layout optimal. We could fix that by (with 4-ay
unrolling) give each block one overhead insn each, giving BLOCK(3) the
branch, putting its final insn in the delay slot.
And verified quickly using an ad-hoc test program that each iteration
executes in 7 cycles, which is 3.5 cycles per limb.
5 cycles for actual work, 2 cycles for bookkeeping, i.e., bookkeeping
adds 40%. Any unrolling will need 2 cycles for bookkeeping, but we can
get more work done with 4-way or even 8-way unrolling.
The only disappointment is that we are just one register shy of being
able to avoid allocating a register window. :-/ I was initially
thinking that I could avoid the register window if we get rid of 'n'
and just use a comparison against 'up' or 'rp' as the loop condition,
but that doesn't work since we need a 'up_base' or similar in another
register for the comparison, nullifying our gains.
I used new registers judiciously. Clearly, the 'or' result registers
could be coalesced.
More information about the gmp-devel