SSE2 rshift for 64bit Core2

Thu Mar 20 10:32:54 CET 2008

On Wed, Mar 19, 2008 at 04:35:33PM +0100, Torbjorn Granlund wrote:
> I looked briefly at your work.
> 
> Using SSE instructions for GMP with good performance is tricky.  In
> 32-bit mode, we can use 64-bit adds to get carry, but in 64-bit mode
> we have nothing reasonably efficient to find out if carry occured.
> 
> There are useful general shift instructions for 64-bit quantities
> already in MMX, however, the 128-bit SSE shift instructions do not
> allow general shift counts.

 True, but SSE2 has proper shift instructions that do twice as much work
with each instruction.  psrlq and psllq shift the two quadwords in the XMM
register independently.  Just like the MMX version of the same instruction
in the code I started from.  I construct shift-with-carry from limb_i =
limb_i >> cnt | limb_(i+1)<<(64-cnt)

 The code I've posted works for anything shift.c can throw at it, with any
valid shift count.  It compares the output to a reference implementation,
with guard data at the head and tail.  (Well, I usually just test with
cnt=13, since none of the code does anything conditional on the shift count.)

 Someone who cared about 32bit (i.e. not me) might be able to write a pretty
good shift function.  Actually, mine would probably port pretty well to
ia32, since I don't use any extended regs.  And the shift operation doesn't
change at all; it could still go in 64bit chunks.  The main problem is
alignment.  An ia32 version would have to deal with args that were
potentially only 32bit aligned.  An shrd intro/outro could take care of
that, though.

 Here's the spreadsheet I made up to keep track of data movement in the XMM
regs: http://cordes.ca/repos/gmp-shift/asm.ods
Here's a forum thread where I posted my wish that there was a program to do
that for me that already knew x86/AMD64 asm:
http://ubuntuforums.org/showthread.php?p=4545971

> We don't have that problem for Core2.  The dual-word shrd/shld
> instructions are actually well implemented here, and should allow us
> to approach 1 cycle/limb.

 You're right.  I wrote another version of rshift, this time based on
mpn/x86/rshift.asm (simple pipelined shrd loop).  Initially I was getting
1.89 cycles/limb, but I think the bottleneck was all the uops[1] that
addressing modes like (%rdi, %rdx, 8) generate.  code here:
http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/rshift.asm?c=annotate&p=20080319184937-0feb0-8597190e032a0a354173afcccc8a9cc669952dfb.gz
Using 8(%rdi), etc. and add $16, %rdi  at the end of the loop gave me 1.79c/l.

 Unrolling the loop to do 4 limbs per iteration sped it up to ~1.571 c/l.
That's the best I've managed.  Unrolling further makes the loop more than 18
instructions, and more than 64bytes, so it doesn't fit in Core 2's loop
stream detector buffer (which caches the pre-decoded instructions, avoiding
the instruction fetch stages while the loop is running).  At 8
limbs/iteration, it runs slower: ~1.67 c/l.

Here's the code for the version with a computed jump:
http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/rshift.asm?c=annotate&p=20080320084938-0feb0-2d47838f40bb97c157f5ec61a959ad2764bede17.gz
A previous version of the code that used three jcc instructions instead of a
computed jump is here:
http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/rshift.asm?c=annotate&p=20080320044409-0feb0-ec31d4fbf5cc6bdfb9751d389329a19eb776f4db.gz
It's faster in some small n cases, but slower in a few others.  I think all
those branches are likely to pollute the branch predictor, and maybe aren't
a good thing when we're not being called in a tight loop so everything is in
the cache and correctly predicted, etc.

 A very interesting discovery is that the SSE2 version (with 8byte loads and
stores, so no alignment requirements) is faster for large n. If we're
working from L2 cache, because args don't fit in L1, sse2 goes ~9c/l, shrd
goes ~13c/l.  With args too big for L2, sse2 goes ~11.5c/l, while shrd goes
~14c/l.  Core2 must do something differently when loading to xmm registers
with movq, or when storing with movq/movhpd, compared to 64bit loads/stores
from the integer registers.  Or maybe just the OOO execution engine stalls
differently in the two loops.  I haven't done much with oprofile on this
version.  Maybe partly because I'm starting to remember more easily how Core
2's pipeline works generally.  Anyway, I put a size test in the code to use
the sse2 version if n is >= 16000 limbs.

 All times are for Conroe, and were the same on Harpertown(Penryn).  (except
the sse2 times, which are slower on Harpertown than Conroe.)  K8 is hopeless
on this code, too: ~4.5 c/l for n=496.

 It might also be possible to go faster for small n, with a function that
didn't do a computed goto.  The imul and all the lea setup for that takes a
lot of space and time.  There's probably some room to optimize that in the
current function, but I don't know all the asm tricks.

[1] Intel's optimization guide keeps talking about using single-uop
instructions, but nowhere have I seen anything about how to figure out how
many uops a given instruction will decode to.  I'm guessing every arithmetic
operation required by the instruction and address generation for its
operands is a uop...  I counted my loop that way and it seemed sensible that
all the (%rdi, %rdx, 8) addresses + the shifts would keep all three ALU
execution units busy full time and achieve however many clocks/limb I was
getting.

 I wonder if it would be possible to gain any speed with a loop that mixed
ALU (shrd) with SSE2 shifts.  Maybe split the src in 1/2 or 2/3, and do the
top part with SSE2, and the bottom part with shrd.  You could choose the
split point so the dest was 16byte aligned, too.

[2] Here's the code with some comments stripped out.  Get the full version from
http://cordes.ca/repos/gmp-shift/rshift.asm
or darcs get http://cordes.ca/repos/gmp-shift/, or darcs pull if you already did a get.

ASM_START()
C TODO: remove this, or make it 16
C ALIGN(1024)
C ALIGN(16)
PROLOGUE(mpn_rshift_core2)
	cmp	$16000,	%rdx
	jge	mpn_rshift_sse2		C TODO: and not penryn/harpertown unless we can use the aligned version

	C shift count can stay where it is in %rcx
	push	%rbx
	mov	(%rsi), %rbx		C %rbx = limb0
	push	%rbp
	xor	%eax,	%eax
	shrd	%cl,	%rbx,	%rax	C %rax = ret val limb.  %rbx still = limb0

	C do some work that's wasted for (n-1)%4==0.  This hopefully shortens the critical path for the computed jump
	mov	%rdx,	%rbp
	sub	$1,	%rbp
	and	$3,	%rbp
	imul	$-12,	%rbp,	%r8	C do this early to hide the latency
	lea	(L(c2_1)+3*12)(%r8), %r8
C %rax = ret val; %rbx=limb0; %rcx=shift count; %rdx=size

	sub	$1,	%rdx		C %rdx = n-1
	jle	L(c2_end_short)		C if(n<=1), no loop.  %rdi on entry points to top (only) limb
	push	%rax
	mov	%rbx,	%rax		C get limb0 in both regs, so we can jump into the loop anywhere.

	test	%rbp, %rbp
	jz	L(c2_plus1)		C (n-1)%4==0, n==4m+1.  special case: don't need to adjust pointers (and the code below would fail because (n-1)%4 = 0, not 4.)

C adjust pointers for jump into loop:  %rsi, %rdi -= (4 - (n-1)%4)*8 == add -4*8 + ((n-1%4))*8
	lea	(-4*8)(%rdi,%rbp,8), %rdi
	lea	(8-4*8)(%rsi,%rbp,8), %rsi
C	lea	8(%rsi), %rsi		C already loaded the first limb

C	cmp	$2,	%rbp
C FIXME: definitely use a computed jump; this is slow on small n. when it's the last branch that's taken
C	jg	L(c2_1)			C %rbp=3, (n-1)%4=3, n=4m
C	je	L(c2_2)			C %rbp=2, (n-1)%4=2, n=4m-1
C	jl	L(c2_3)			C %rbp=1, (n-1)%4=1, n=4m-2

C	jmp	*L(c2_1)(%r8)
	jmp	*%r8

L(c2_plus1):
	lea	8(%rsi), %rsi
C	add	$4,	%rdx

L(c2_loop):					C start here on n = 4m+1. with rdi,rsi -= 0.  (n-1)%4 = 0
	mov	(%rsi), %rbx		C load next higher limb
	shrd	%cl, %rbx, %rax		C compute result limb
	mov	%rax,	(%rdi)		C store it
L(c2_1): mov	8(%rsi), %rax			C start here on n = 4m. with rdi,rsi -= 1*8.  (n-1)%4=3
	shrd	%cl, %rax, %rbx
	mov	%rbx,	8(%rdi)
L(c2_2): mov	16(%rsi), %rbx			C start here on n = 4m-1.  with rdi,rsi -= 2*8 (n-1)%4=2
	shrd	%cl, %rbx, %rax
	mov	%rax,	16(%rdi)
L(c2_3): mov	24(%rsi), %rax			C start here on n = 4m-2.  with rdi,rsi -= 3*8 (n-1)%4=1
	shrd	%cl, %rax, %rbx
	mov	%rbx,	24(%rdi)

L(c2_4):					C jump in here could be better than coming in the top
	add	$32,	%rsi
	add	$32,	%rdi
	sub	$4,	%rdx
	jg	L(c2_loop)

C L(c2_end):
	shr	%cl,	%rax		C compute most significant limb
        mov	%rax,	(%rdi)		C store it
	pop	%rax			C return val
	pop	%rbp
	pop	%rbx
	ret

C probably get rid of this and put an extra insn or two on the n=1 path, allowing a jump to c2_end
L(c2_end_short):
	shrq	%cl,	%rbx		C compute most significant limb
        movq	%rbx,	(%rdi)		C store it
	pop	%rbp
	pop	%rbx
	ret
EPILOGUE()

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC