SSE2 rshift for 64bit Core2

Wed Mar 19 16:35:33 CET 2008

I looked briefly at your work.

Using SSE instructions for GMP with good performance is tricky.  In
32-bit mode, we can use 64-bit adds to get carry, but in 64-bit mode
we have nothing reasonably efficient to find out if carry occured.

There are useful general shift instructions for 64-bit quantities
already in MMX, however, the 128-bit SSE shift instructions do not
allow general shift counts.

Then we have the problem of latency.

For the 64-bit Pentium4, MMX or SSE shifts will be best for mpn_lshift
and mpn_rshift, since right shifts in the integer registers have a
latency of around 10 cycles.

We don't have that problem for Core2.  The dual-word shrd/shld
instructions are actually well implemented here, and should allow us
to approach 1 cycle/limb.

Perhaps SSE-based mpn_?shift could approach 1 cycle/limb too, for
shift counts <= 8 and shift counts that are a multiple of 8.  But
since shrd/shld allows that performance for any shift count, I think
SSE is not the right approach here.

-- 
Torbjörn