SSE2 rshift for 64bit Core2

Peter Cordes peter at
Tue Mar 18 20:19:36 CET 2008

 Hi all,

 I thought it would be fun to try my hand at optimizing some assembly code,
so I made an SSE2 version of mpn/x86_64/rshift.asm.  It's faster than GMP's
current MMX code on Core2 (at ~2.0 cycles/limb), but not on K8 (at ~3.79
c/l).  It might do better on K10 (Barcelona), which widens the SIMD
execution units to 128bits.  (K8 has to do two 64bit ops when doing things
to full XMM registers.)  I don't have access to any Barcelona machines, just
K8, Conroe/Woodcrest, and Harpertown.  I also don't have access to any P4
hardware running in 64bit mode.  It sucks for GMP anyway, so I wouldn't
worry about it too much.

 I don't have much practice with actually writing asm, so my code is heavily
commented.  I hope I've deleted all the comments that became stale because
of re-ordering the code...

 The other major limitation of my code currently is that it requires the
input and output operands to be 16 byte aligned.  I use movdqa to load and
store.  It might be possible to maintain 2.0 c/l on Core2 while using 64 bit
loads and stores, since that's still just one load per cycle.  Switching
movdqa to movdqu (mov double-quad-word aligned/unaligned) puts the
large-input behaviour up to ~4.0 cycles/limb, even if I just use movdqu for
storing the result.  (I would have thought that would be ok, since the loop
isn't waiting for the store.)

 The other option would be to add code to handle the first limb and then go
into the loop for the aligned part.

 Using tests/devel/shift.c (which I had to clean up a bit to get it to
compile), I did some timing.  See comments in the assembly code for all the
results.  I tested with calls to statically linked functions, which saves
~1.5 cycles.  On my 2.4GHz Conroe (E6600):

The MMX original runs in
size 1:               16.530  cycles
size 2:               9.547   cycles/limb
size 4:               5.775   cycles/limb
size 496:             2.445   cycles/limb  (same on 2.8GHz Harpertown)
size 10000000         14.823  c/l (dual-channel DDR800, g965 chipset)

My code runs in:
size 1                11.968          cycles
size 2                6.000           c/l
size 3                5.333           c/l
size 4                4.0             cycles/limb
size 496              2.052           cycles/limb
size 10000000         10.820          c/l (dual-channel DDR800, g965 chipset)

(Odd-sized arguments are handled as if they were the next size larger,
except for the store at the end.  Which explains the size 1 and size 3

 For testing, I work on a copy of rshift.asm in tests/devel, instead of
re-linking the whole libgmp every time I want to test.  I had to copy a rule
into the Makefile for building rshift.o from rshift.asm:

rshift.o: rshift.asm
	$(M4) -DOPERATION_$* `test -f '$<' || echo '$(srcdir)/'`$< >tmp-$*.s
	$(CCAS) $(COMPILE_FLAGS) tmp-$*.s -o $@
#	$(RM_TMP) tmp-$*.s

 Note that the backquote don't copy/paste properly if your editor does
anything funny.  (jed for some reason requires typing them twice...)
To build and run, I do:
make CC="gcc -g -Wall -static" shift && ./shift 2400000000
(or sudo nice --10 ./shift ..., to get better timing).  24... is my CPU's
clock speed.

 For revision control, I use DARCS.  You can browse my repo at
(not that it's probably going to be very useful, though.  There are only two
files, rshift.asm and shift.c from tests/devel.)
To just get the files, run
darcs get
(or go to the URL with a browser and ignore _darcs)

 Further optimization probably requires unrolling the loop a bit, or
pipelining it more.  This would require more setup/cleanup code to deal with
the end of the array, since we have to prevent the loop from reading into
the next memory page and potentially segfaulting.

 Also, adding special-case support for shifting by multiples of 8, multiples
of 16, or 32 could give big speedups for those special cases.  SSE/SSE2
provides lots of instructions for re-packing byte and word data between
registers.  Also there's psrldq/pslldq, which shifts a whole 128bit register
by a fixed number of bytes (not bits).  The shift count has to be encoded in
the instruction, not from a register, though.  So it doesn't help cut down
the number of code paths for special cases.  Come to think of it, I think
most of the shuffle instructions need the shuffle as an immediate, so you
might need separate code paths for shift by 16 and shift by 48. :(

 How important are shifts by (any specific) multiples of 8?

 I found one strange result: Using pshufd to do a reg-reg move with
re-arrangement results in 2.0c/l on Conroe, where the instruction has a
latency of 2, but 2.5c/l on Harpertown where its super-shuffle engine gives
that instr a latency of 1.  Maybe having it ready too soon causes a ROB read
port stall or something.  I don't have root on a Harpertown to run oprofile.

 One tricky decision I had to make was how to go from
 %xmm3 = limb1:limb0; %xmm7 = limb3:limb2   to
 %xmm3 = limb1:limb0; %xmm6 = limb2:limb1
(notation: %xmm3 = low,high, which is big endian, and thus backwards. sorry)

 The best I've come up with is shufpd, which makes the loop run at 2.0c/l on
both Core2 arches, but 3.791c/l on K8.  Using movhlps and punpcklqdq works,
and is faster on K8, but hurts performance for n=4 limbs.  Using pshufd
gives 2.0c/l on Conroe, but 2.5c/l on Harpertown.  It's still only 3.0c/l on
K8, so the MMX version is still best on K8 anyway.

 I never found any very clear word on the penalties for mixing floating
point SSE with integer SSE.  Something I saw said that XMM registers are
marked as containing FP or int data, and there's a penalty when you use an
instruction that treats them other than how they're marked.  It wasn't clear
on whether it was only math ops, or even shuffle ops like shufpd that
counted for this, or what the penalties were.

 I hope someone finds this useful.  I know it's not currently ready to go
into GMP, since it segfaults on unaligned arguments.  And GMP calls rshift
with unaligned arguments itself, and doesn't say anything in the docs
requiring users to keep args aligned.  I just wanted to post here to see if
anyone had any comments or feedback before I spend more time on it.

#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor ,

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC
-------------- next part --------------
dnl  AMD64 mpn_rshift

dnl  Copyright 2004, 2006 Free Software Foundation, Inc.

dnl  This file is part of the GNU MP Library.

dnl  The GNU MP Library is free software; you can redistribute it and/or modify
dnl  it under the terms of the GNU Lesser General Public License as published
dnl  by the Free Software Foundation; either version 3 of the License, or (at
dnl  your option) any later version.

dnl  The GNU MP Library is distributed in the hope that it will be useful, but
dnl  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
dnl  License for more details.

dnl  You should have received a copy of the GNU Lesser General Public License
dnl  along with the GNU MP Library.  If not, see


C MMX version:
C		    cycles/limb
C AMD K8:		 2.5
C Intel P4:		 4
C Intel Core 2:		 2.4

C all benchmarks on tesla: E6600 (2.4GHz) w/DDR800 dual-channel.  system not idle, azureus running.

C MMX: (Conroe 2.4GHz. calls to dynamically linked lib function)
C size 1		18.000  cycles/limb
C size 2		9.960-10.020 cyc/limb
C size 49:		2.673	cycles/limb
C size 496:		2.465	cycles/limb
C size 1600:		2.402	cycles/limb

C MMX: (Conroe 2.4GHz. calls to statically linked lib function)
C size 1:		16.530  cycles
C size 2:		9.547	cycles/limb
C size 4:		5.775	cycles/limb
C size 496:		2.445	cycles/limb  (same on 2.8GHz Harpertown)
C size 10000000:	14.823	c/l (dual-channel DDR800, g965 chipset)

C MMX: (K8 2.6GHz (ACT cluster). calls to statically linked lib function.)
C size 496:		2.553	cycles/limb

C this SSE2 version: requires 16byte aligned input and output
C +++++++++++++++++++++ shufpd version with slow (%cl) intro ++++++++++++++++++++
C AMD K8 (awarnach master node, 2.0GHz, Solaris)
C size 1:		14-15		cycles
C size 2:		8.025		cycles/limb
C size 49:		4.337		cycles/limb
C size 496:		3.808		cycles/limb
C size 496, ACT 2.6G	3.788		cycles/limb.  Linux on dual 2218 2.6GHz.
C size 1600:		3.652-3.802	cycles/limb
C size 4000:		3.751		cycles/limb
C size 496001:		12.807		cycles/limb

C Intel Core 2(Conroe 2.4GHz.  calls to statically linked .o)
C size 1:		13.024		cycles.
C size 2:		6.5-6.6		cycles/limb
C size 4:		4.275		cycles/limb (took a long time to settle.  often swung up to 8.800. prob. branch pred problems)
C size 49:		2.245		cycles/limb
C size 496,800:		2.064		cycles/limb
C size 1600:		1.981-2.021	cycles/limb
C size 4000:		2.401		cycles/limb
C size 496001:		8.607		cycles/limb

C Intel Core 2(Harpertown 2.8GHz, system idle. calls to statically linked .o)
C size 496:		2.087		cycles/limb (more measurement overhead?)

C ++++++++++++++++++++ fast intro (%ecx), new loop, shufpd (and sometimes movhlps) version +++++++++++++++++++
C AMD K8 (Opteron 2218)
C size 1		15.938		cycles ( movhlps)
C size 4		5.947		cycles/limb (shufpd)
C size 496		3.791		cycles/limb (3.044 movhlps, still slower than MMX)

C Core 2 Conroe (2.4GHz)
C size 1		12.224		cycles  (11.968 with pxor after the loop)
C size 2		6.000		c/l
C size 3		5.333		c/l
C size 4		4.0		cycles/limb  (6.420 with movhlps)
C size 496		2.052		cycles/limb (2.036-2.068 movhlps)
C size 10000000		10.820		c/l (dual-channel DDR800, g965 chipset)

C Core 2 Harpertown (2.8GHz)
C size 1		12.049		cycles
C size 4		4.013		cycles/limb (shufpd)
C size 496		2.062		cycles/limb (shufpd/movhlps)

C movdqu (unaligned allowed, times for the aligned case) (Conroe 2.4GHz)
C size 1:		14.000	    cycles/limb
C size 2:		6.990-7.050 cycles/limb
C size 496, 4000:	4.048	    cycles/limb
C size 496001:		8.787-8.807 cycles/limb

C psllq has a latency of 2 cycles, throughput 1
C 2c/limb = 4c/2limb = 4cycles for the whole loop
C need to hide the latency of the shufpd?  (lat=1, through=1)
C reversing the order of the shifts doesn't help.

C 4 cycles ?= latency chain of movdqa (load), shufpd, psllq, por(?)
C further gains would take more loop unrolling or software pipelining to hide the latency of the load
C which would require a bigger intro/outro to not read past the end of the array
C currently the function fits nicely into 128bytes

C optimization: could maybe structure things so limb0,limb1 need the shuffle, to hide the latency.
C probably would need another shuffle before storing, though

C rp	rdi
C up	rsi
C n	rdx    C not allowed to be negative.  not doubling as a sign bit.
C cnt	rcx

C could probably make seriously optimized versions for multiple-of-8 shift counts using SSE2 shuffle instructions.
C note: %6 = limb0, limb1  means the _high_ quad of %xmm6 = limb1.  Intel is little-endian, so this is a bit bogus.  It actually matters if you use psrldq or something.

	movdqa	(%rsi), %xmm3		C %3 = limb0, limb1
	movd	%ecx, %xmm1
	sub	$64, %ecx		C cnt must be <=64, so it's ok to operate on small version of it
	neg	%ecx			C we want 64-cnt in ecx as a shift count for getting the return value
	movq	%xmm3, %rax		C %rax = limb0
 	movd	%ecx, %xmm0		C %0=64-cnt=left count=lc; %1=cnt;  C this can go anywhere before the loop.
	shlq	%cl, %rax		C return value=limb0<<lc. shift count has to be in %cl.	 C modifies flags, so it has to go before the loop

	leaq	(%rsi,%rdx,8), %rsi     C rsi = rsi + rdx*8 = last limb + 1
	leaq	(%rdi,%rdx,8), %rdi
	negq	%rdx
C rsi=last srclimb+1; rdi=last destlimb+1; rdx=-n; %0=64-cnt; %1=cnt;
C %3=limb0,limb1;

	addq	$2, %rdx		C n= -n + 2
	jge	L(out)			C skip the loop if n<=2

	ALIGN(16)			C minimal alignment for claimed speed
	movdqa	(%rsi,%rdx,8), %xmm7	C %7=limb2,limb3.  If src is 16byte aligned, this can't cross a page boundary and segfault.  we might harmlessly read past the end of the array of limbs if its length is odd.

C good:  pshufd works, and gave 2.0 cycles/limb on conroe, but 2.5 on Harpertown. (?!).  3.0 on K8
C	pshufd  $14, %xmm3, %xmm6	C %6=limb1,  14 = (2 + 3<<2)  C latency=2 on pre-penryn

C the best alternative for getting limb1 in the low part of a register
C	pxor	%xmm6, %xmm6		C break the partial-reg dependency.  2.0 cycles/limb instead of 2.5 on conroe and penryn
C	movhlps	%xmm3, %xmm6
C require %6=limb1,xxx; %7=limb
C	punpcklqdq %xmm7, %xmm6		C dest=dest[0],src[0]

	movdqa	%xmm3, %xmm6
C require: %3=limbs(0,1); %6=xxx,limb1; %7=limbs(2,3)  C shufpd version ran 2.0 cycles/limb on conroe and penryn, but 3.7 instead of 3.0 on K8
	shufpd	$1, %xmm7, %xmm6		C %6=limbs(1,2).  take dest[1],src[0], so op=1+0<<1

C bad	pslldq	$4, %xmm2	C only operates in-place and has latency=2 on pre-penryn

C require: %3=limb0,limb1; %6 = limb1,limb2
	psllq	%xmm0, %xmm6
	psrlq	%xmm1, %xmm3		C %3=limbs(0,1)>>cnt; %6=limbs(1,2)<<lc
	por	%xmm6, %xmm3		C %3=result limbs(0,1)
	movdqa	%xmm3, -16(%rdi,%rdx,8) C store the result.

C require %7=limbs(2,3) (next iteration's limbs(0,1))
C	movdqa	%xmm7, %xmm6		C setup for next iter.
	movdqa	%xmm7, %xmm3
	addq	$2, %rdx
	jl	L(loop)

C %3=limb0,limb1 (since it was loaded before rdx was incremented).
	C seems to make no diff where we put pxor, so move it to function start if that helps alignment
	pxor	%xmm6, %xmm6		C we need this for later, in L(out).

	movdqa  %xmm3, %xmm2
	punpckhqdq %xmm6, %xmm3		C %6=0,0;  %3=limb1,0
	psllq	%xmm0, %xmm3		C %3=limb1<<lc,0
	psrlq	%xmm1, %xmm2		C %2=(limb0,1)>>c
C require: %3=limb1<<lc, 0; %2=limb0>>c,limb1>>c
	por	%xmm2, %xmm3		C %3=result limbs 0,1
	jg	L(endo)			C n = 1 limb left 	C condition flags still set from loop counter
	movdqa	%xmm3, -16(%rdi)	C store the result.

	movq	%xmm3, -8(%rdi)		C an odd limb to store

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 351 bytes
Desc: Digital signature
Url : 

More information about the gmp-devel mailing list