[PATCH] T3/T4 sparc shifts, plus more timings

Fri Mar 29 04:28:22 CET 2013

David Miller <davem at davemloft.net> writes:

  So here is a working generic 64-bit sparc lshift.asm that seems
  to work well on all chips.

Great!  I'll do a partial check-in of your work thus far, after some
hours of rest.

  I'm now going to iterate over lshiftc and rshift.

Making rshift from lshift (or vice versa) can usually be made almost
mindlessly...

I noticed nand in "vis".  But it lookes like it operates on fp
registers.  And there might not be any useful shift insn in vis.  (We'd
want generic code anyway, of course.)

I put some numbers in () and {} at http://gmplib.org/devel/asm.html for
T4.

With your multi-pointer trick there is a drive towards less unrolling.
I almost always do 4-way, occasionally-way.  Clearly, setting up 8 src
pointers and 8 dst pointers would just be silly.  Even 4 + 4 pointers is
a bit expensive.  The most critical routines, in descending order
addmul_2, addmul_1, mul_2, mul_1, add_n, sub_n, lshift, rshift, ...
should be whacked closer to optimum, but without forgetting that a big
constant-time setup cost will hurt in many important usage cases.

We should also keep in mind that *mul_(k) are rarely used for large trip
counts.  More than a few dozen limbs will only happen for unbalanced
multiplies (one operand large, the other operand max a few dozen limbs).
On the other hand, add_n, sub_n, lshift, rshift, lshiftc, are sometimes
used for operands that don't fit L1d.

Prefetch will cost a lot for an issue-limited CPU like T4.  To mitigate
that cost, one would want to drive up unrolling to handle a full cache
line, which would be 8-way.  Not pretty.

We don't have prefetch in many asm GMP routines yet.

-- 
Torbjörn