[PATCH] T3/T4 sparc shifts, plus more timings

David Miller davem at davemloft.net
Tue Mar 26 20:39:24 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 26 Mar 2013 14:28:34 +0100

> L(top):
> 	sllx	%l0, cnt, %l2
> 	or	%l4, %l3, %l6
> 	srlx	%l0, tcnt, %l4
> 	ldx	[up+0], %l0
> 	stx	%l6, [rp+0]
> 	sllx	%l1, cnt, %l3
> 	or	%l5, %l2, %l7
> 	srlx	%l1, tcnt, %l5
> 	ldx	[up+8], %l1
> 	stx	%l7, [rp+8]
> 	add	up, 16, up
> 	add	rp, 16, rp
> 	brgz	n, L(top)
> 	 add	n, -2, n
> This should be a semi-working loop.

So I've distilled your instructions into the follow loop:

        or      %g4, %g1, %l1
        sllx    %g2, cnt, %g1

        srlx    %g2, tcnt, %g4
        ldx     [up - 8], %g2

        stx     %l1, [rp - 8]
        or      %g3, %l2, %l7

        sllx    %g5, cnt, %l2
        srlx    %g5, tcnt, %g3

        ldx     [up - 16], %g5
        sub     up, 16, up

        stx     %l7, [rp - 16]
        sub     rp, 16, rp

        brgz    n, L(top)
         add    n, -2, n

And verified quickly using an ad-hoc test program that each iteration
executes in 7 cycles, which is 3.5 cycles per limb.

The only disappointment is that we are just one register shy of being
able to avoid allocating a register window. :-/ I was initially
thinking that I could avoid the register window if we get rid of 'n'
and just use a comparison against 'up' or 'rp' as the loop condition,
but that doesn't work since we need a 'up_base' or similar in another
register for the comparison, nullifying our gains.

Anyways, I think this loop above is optimal for 2-way unrolling on T4
since the chip can only retire up to 2 instructions per cycle.

I'm half-way done with the feed-in and wind-down code and once I'm done
I'll post what I come up with.


More information about the gmp-devel mailing list