[PATCH] T3/T4 sparc shifts, plus more timings
davem at davemloft.net
Tue Mar 26 20:39:24 CET 2013
From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 26 Mar 2013 14:28:34 +0100
> sllx %l0, cnt, %l2
> or %l4, %l3, %l6
> srlx %l0, tcnt, %l4
> ldx [up+0], %l0
> stx %l6, [rp+0]
> sllx %l1, cnt, %l3
> or %l5, %l2, %l7
> srlx %l1, tcnt, %l5
> ldx [up+8], %l1
> stx %l7, [rp+8]
> add up, 16, up
> add rp, 16, rp
> brgz n, L(top)
> add n, -2, n
> This should be a semi-working loop.
So I've distilled your instructions into the follow loop:
or %g4, %g1, %l1
sllx %g2, cnt, %g1
srlx %g2, tcnt, %g4
ldx [up - 8], %g2
stx %l1, [rp - 8]
or %g3, %l2, %l7
sllx %g5, cnt, %l2
srlx %g5, tcnt, %g3
ldx [up - 16], %g5
sub up, 16, up
stx %l7, [rp - 16]
sub rp, 16, rp
brgz n, L(top)
add n, -2, n
And verified quickly using an ad-hoc test program that each iteration
executes in 7 cycles, which is 3.5 cycles per limb.
The only disappointment is that we are just one register shy of being
able to avoid allocating a register window. :-/ I was initially
thinking that I could avoid the register window if we get rid of 'n'
and just use a comparison against 'up' or 'rp' as the loop condition,
but that doesn't work since we need a 'up_base' or similar in another
register for the comparison, nullifying our gains.
Anyways, I think this loop above is optimal for 2-way unrolling on T4
since the chip can only retire up to 2 instructions per cycle.
I'm half-way done with the feed-in and wind-down code and once I'm done
I'll post what I come up with.
More information about the gmp-devel