SSE2 rshift for 64bit Core2
Torbjorn Granlund
tg at swox.com
Fri Mar 21 23:19:14 CET 2008
Peter Cordes <peter at cordes.ca> writes:
> We don't have that problem for Core2. The dual-word shrd/shld
> instructions are actually well implemented here, and should allow us
> to approach 1 cycle/limb.
You're right. I wrote another version of rshift, this time based on
mpn/x86/rshift.asm (simple pipelined shrd loop). Initially I was getting
1.89 cycles/limb, but I think the bottleneck was all the uops[1] that
addressing modes like (%rdi, %rdx, 8) generate. code here:
http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/rshift.asm?c=annotate&p=20080319184937-0feb0-8597190e032a0a354173afcccc8a9cc669952dfb.gz
Using 8(%rdi), etc. and add $16, %rdi at the end of the loop gave me 1.79c/l.
Unrolling the loop to do 4 limbs per iteration sped it up to ~1.571 c/l.
That's the best I've managed. Unrolling further makes the loop more than 18
instructions, and more than 64bytes, so it doesn't fit in Core 2's loop
stream detector buffer (which caches the pre-decoded instructions, avoiding
the instruction fetch stages while the loop is running). At 8
limbs/iteration, it runs slower: ~1.67 c/l.
I reached 1.25 c/l with 4-way unrolling some time ago. This is the
inner loop:
L(top): shld %cl, %r8, %r11
mov (up), %r10
mov %r11, (rp)
L(10): shld %cl, %r9, %r8
mov -8(up), %r11
mov %r8, -8(rp)
L(01): shld %cl, %r10, %r9
mov -16(up), %r8
mov %r9, -16(rp)
L(00): shld %cl, %r11, %r10
mov -24(up), %r9
lea -32(up), up
mov %r10, -24(rp)
lea -32(rp), rp
sub $4, n
jnc L(top)
Possibly, this could approach 1 c/l with greater unrolling, but
perhaps the 64 byte loop buffer limit will actually make 4-way
unrolling optimal (the loop is 60 bytes).
A very interesting discovery is that the SSE2 version (with 8byte loads and
stores, so no alignment requirements) is faster for large n. If we're
working from L2 cache, because args don't fit in L1, sse2 goes ~9c/l, shrd
goes ~13c/l. With args too big for L2, sse2 goes ~11.5c/l, while shrd goes
~14c/l. Core2 must do something differently when loading to xmm registers
with movq, or when storing with movq/movhpd, compared to 64bit loads/stores
from the integer registers. Or maybe just the OOO execution engine stalls
differently in the two loops. I haven't done much with oprofile on this
version. Maybe partly because I'm starting to remember more easily how Core
2's pipeline works generally. Anyway, I put a size test in the code to use
the sse2 version if n is >= 16000 limbs.
I suggest that you use tune/speed for speed measuring (and devel/try
for correctness checking).
The loop above runs at 1.25 c/l from L1, and a bit over 3 c/l from L2,
and 13 c/l on my system from main memory. The main memory time will
likely vary a lot with CPU frequency and memory subsystem frequency.
All times are for Conroe, and were the same on Harpertown(Penryn). (except
the sse2 times, which are slower on Harpertown than Conroe.) K8 is hopeless
on this code, too: ~4.5 c/l for n=496.
I believe shifting is the one GMP operation where the Core2 family
can the K8 family.
I counted my loop that way and it seemed sensible that
all the (%rdi, %rdx, 8) addresses + the shifts would keep all three ALU
execution units busy full time and achieve however many clocks/limb I was
getting.
I doubt it. Using a plain adressing like (reg) and then increment or
decrement that address register in an (unrolled) loop can be
beneficial for P6 family processors for one reason only: Lack of
register read ports. A recently modified register is read through a
forwarding path, but a base register will need to be read from the
register bank. There are only two read ports, and register read
contention is a common cause for lower-than-expected performance.
I wonder if it would be possible to gain any speed with a loop that mixed
ALU (shrd) with SSE2 shifts. Maybe split the src in 1/2 or 2/3, and do the
top part with SSE2, and the bottom part with shrd. You could choose the
split point so the dest was 16byte aligned, too.
Nice idea, but I think it will not be faster than using just shXd,
since that loop transfers close to 128 bits/cycle, presumably the L1
cache bandwidth.
--
Torbjörn
More information about the gmp-devel
mailing list