[PATCH] T3/T4 sparc shifts, plus more timings
davem at davemloft.net
Wed Mar 27 21:15:54 CET 2013
From: David Miller <davem at davemloft.net>
Date: Wed, 27 Mar 2013 15:43:54 -0400 (EDT)
> I then used the 4-way ultrasparc1234 code as-is with the fanops
> (carefully) removed, and this executes at 3.0 cycles per limb. The
> main 4-way unrolled loop executes in 12 cycles.
> My suggestion at this point is that we use the ultrasparc1234 code
> with the fanops removed, even on T1/T2 since the decrease in the
> number of bookkeeping operations will help even on those chips.
> Just a note that some of the fanop removals have to be done
> non-trivially since they live in delay slots. In all such cases I
> simply moved the first instruction I could from before the branch into
> the delay slot.
As an aside I think we can get it down to 2.5 cycles per limb on
T4 with 4-way unrolling, and 3.0 cycles per limb with 2-way
The idea is to decrease the bookkeeping instructions by only
maintaining base pointers which do not change, and then we have an
offset which operates as the loop index.
So we'd instead have an 'n_off' instead of 'n', and then in some local
registers we'd hold:
l3: up - 8
l4: up - 16
l5: rp - 8
l6: rp - 16
And then the memory operations would have addresses like:
[%l3 + n_off]
[%l4 + n_off]
Finally, for the loop iteration we'll use T4's fused compare
and branch instruction, something like:
cwbne n_off, n_off_end, L(top)
add up, -16, up
add rp, -16, rp
and thus saves us an entire cycle in the loop.
More information about the gmp-devel