[PATCH] T3/T4 sparc shifts, plus more timings

Wed Mar 27 21:15:54 CET 2013

From: David Miller <davem at davemloft.net>
Date: Wed, 27 Mar 2013 15:43:54 -0400 (EDT)

> I then used the 4-way ultrasparc1234 code as-is with the fanops
> (carefully) removed, and this executes at 3.0 cycles per limb.  The
> main 4-way unrolled loop executes in 12 cycles.
> 
> My suggestion at this point is that we use the ultrasparc1234 code
> with the fanops removed, even on T1/T2 since the decrease in the
> number of bookkeeping operations will help even on those chips.
> 
> Just a note that some of the fanop removals have to be done
> non-trivially since they live in delay slots.  In all such cases I
> simply moved the first instruction I could from before the branch into
> the delay slot.

As an aside I think we can get it down to 2.5 cycles per limb on
T4 with 4-way unrolling, and 3.0 cycles per limb with 2-way
unrolling.

The idea is to decrease the bookkeeping instructions by only
maintaining base pointers which do not change, and then we have an
offset which operates as the loop index.

So we'd instead have an 'n_off' instead of 'n', and then in some local
registers we'd hold:

l3:	up - 8
l4:	up - 16
l5:	rp - 8
l6:	rp - 16

And then the memory operations would have addresses like:

	[%l3 + n_off]
	[%l4 + n_off]

etc.

Finally, for the loop iteration we'll use T4's fused compare
and branch instruction, something like:

	cwbne	n_off, n_off_end, L(top)

This kills:

	add	up, -16, up
	add	rp, -16, rp

and thus saves us an entire cycle in the loop.