SSE2 rshift for 64bit Core2
tg at swox.com
Wed Mar 19 16:35:33 CET 2008
I looked briefly at your work.
Using SSE instructions for GMP with good performance is tricky. In
32-bit mode, we can use 64-bit adds to get carry, but in 64-bit mode
we have nothing reasonably efficient to find out if carry occured.
There are useful general shift instructions for 64-bit quantities
already in MMX, however, the 128-bit SSE shift instructions do not
allow general shift counts.
Then we have the problem of latency.
For the 64-bit Pentium4, MMX or SSE shifts will be best for mpn_lshift
and mpn_rshift, since right shifts in the integer registers have a
latency of around 10 cycles.
We don't have that problem for Core2. The dual-word shrd/shld
instructions are actually well implemented here, and should allow us
to approach 1 cycle/limb.
Perhaps SSE-based mpn_?shift could approach 1 cycle/limb too, for
shift counts <= 8 and shift counts that are a multiple of 8. But
since shrd/shld allows that performance for any shift count, I think
SSE is not the right approach here.
More information about the gmp-devel