SSE2 rshift for 64bit Core2
peter at cordes.ca
Tue Mar 18 20:19:36 CET 2008
I thought it would be fun to try my hand at optimizing some assembly code,
so I made an SSE2 version of mpn/x86_64/rshift.asm. It's faster than GMP's
current MMX code on Core2 (at ~2.0 cycles/limb), but not on K8 (at ~3.79
c/l). It might do better on K10 (Barcelona), which widens the SIMD
execution units to 128bits. (K8 has to do two 64bit ops when doing things
to full XMM registers.) I don't have access to any Barcelona machines, just
K8, Conroe/Woodcrest, and Harpertown. I also don't have access to any P4
hardware running in 64bit mode. It sucks for GMP anyway, so I wouldn't
worry about it too much.
I don't have much practice with actually writing asm, so my code is heavily
commented. I hope I've deleted all the comments that became stale because
of re-ordering the code...
The other major limitation of my code currently is that it requires the
input and output operands to be 16 byte aligned. I use movdqa to load and
store. It might be possible to maintain 2.0 c/l on Core2 while using 64 bit
loads and stores, since that's still just one load per cycle. Switching
movdqa to movdqu (mov double-quad-word aligned/unaligned) puts the
large-input behaviour up to ~4.0 cycles/limb, even if I just use movdqu for
storing the result. (I would have thought that would be ok, since the loop
isn't waiting for the store.)
The other option would be to add code to handle the first limb and then go
into the loop for the aligned part.
Using tests/devel/shift.c (which I had to clean up a bit to get it to
compile), I did some timing. See comments in the assembly code for all the
results. I tested with calls to statically linked functions, which saves
~1.5 cycles. On my 2.4GHz Conroe (E6600):
The MMX original runs in
size 1: 16.530 cycles
size 2: 9.547 cycles/limb
size 4: 5.775 cycles/limb
size 496: 2.445 cycles/limb (same on 2.8GHz Harpertown)
size 10000000 14.823 c/l (dual-channel DDR800, g965 chipset)
My code runs in:
size 1 11.968 cycles
size 2 6.000 c/l
size 3 5.333 c/l
size 4 4.0 cycles/limb
size 496 2.052 cycles/limb
size 10000000 10.820 c/l (dual-channel DDR800, g965 chipset)
(Odd-sized arguments are handled as if they were the next size larger,
except for the store at the end. Which explains the size 1 and size 3
For testing, I work on a copy of rshift.asm in tests/devel, instead of
re-linking the whole libgmp every time I want to test. I had to copy a rule
into the Makefile for building rshift.o from rshift.asm:
$(M4) -DOPERATION_$* `test -f '$<' || echo '$(srcdir)/'`$< >tmp-$*.s
$(CCAS) $(COMPILE_FLAGS) tmp-$*.s -o $@
# $(RM_TMP) tmp-$*.s
Note that the backquote don't copy/paste properly if your editor does
anything funny. (jed for some reason requires typing them twice...)
To build and run, I do:
make CC="gcc -g -Wall -static" shift && ./shift 2400000000
(or sudo nice --10 ./shift ..., to get better timing). 24... is my CPU's
For revision control, I use DARCS. You can browse my repo at
(not that it's probably going to be very useful, though. There are only two
files, rshift.asm and shift.c from tests/devel.)
To just get the files, run
darcs get http://cordes.ca/repos/gmp-shift
(or go to the URL with a browser and ignore _darcs)
Further optimization probably requires unrolling the loop a bit, or
pipelining it more. This would require more setup/cleanup code to deal with
the end of the array, since we have to prevent the loop from reading into
the next memory page and potentially segfaulting.
Also, adding special-case support for shifting by multiples of 8, multiples
of 16, or 32 could give big speedups for those special cases. SSE/SSE2
provides lots of instructions for re-packing byte and word data between
registers. Also there's psrldq/pslldq, which shifts a whole 128bit register
by a fixed number of bytes (not bits). The shift count has to be encoded in
the instruction, not from a register, though. So it doesn't help cut down
the number of code paths for special cases. Come to think of it, I think
most of the shuffle instructions need the shuffle as an immediate, so you
might need separate code paths for shift by 16 and shift by 48. :(
How important are shifts by (any specific) multiples of 8?
I found one strange result: Using pshufd to do a reg-reg move with
re-arrangement results in 2.0c/l on Conroe, where the instruction has a
latency of 2, but 2.5c/l on Harpertown where its super-shuffle engine gives
that instr a latency of 1. Maybe having it ready too soon causes a ROB read
port stall or something. I don't have root on a Harpertown to run oprofile.
One tricky decision I had to make was how to go from
%xmm3 = limb1:limb0; %xmm7 = limb3:limb2 to
%xmm3 = limb1:limb0; %xmm6 = limb2:limb1
(notation: %xmm3 = low,high, which is big endian, and thus backwards. sorry)
The best I've come up with is shufpd, which makes the loop run at 2.0c/l on
both Core2 arches, but 3.791c/l on K8. Using movhlps and punpcklqdq works,
and is faster on K8, but hurts performance for n=4 limbs. Using pshufd
gives 2.0c/l on Conroe, but 2.5c/l on Harpertown. It's still only 3.0c/l on
K8, so the MMX version is still best on K8 anyway.
I never found any very clear word on the penalties for mixing floating
point SSE with integer SSE. Something I saw said that XMM registers are
marked as containing FP or int data, and there's a penalty when you use an
instruction that treats them other than how they're marked. It wasn't clear
on whether it was only math ops, or even shuffle ops like shufpd that
counted for this, or what the penalties were.
I hope someone finds this useful. I know it's not currently ready to go
into GMP, since it segfaults on unaligned arguments. And GMP calls rshift
with unaligned arguments itself, and doesn't say anything in the docs
requiring users to keep args aligned. I just wanted to post here to see if
anyone had any comments or feedback before I spend more time on it.
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter at cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
-------------- next part --------------
dnl AMD64 mpn_rshift
dnl Copyright 2004, 2006 Free Software Foundation, Inc.
dnl This file is part of the GNU MP Library.
dnl The GNU MP Library is free software; you can redistribute it and/or modify
dnl it under the terms of the GNU Lesser General Public License as published
dnl by the Free Software Foundation; either version 3 of the License, or (at
dnl your option) any later version.
dnl The GNU MP Library is distributed in the hope that it will be useful, but
dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
dnl License for more details.
dnl You should have received a copy of the GNU Lesser General Public License
dnl along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.
C MMX version:
C AMD K8: 2.5
C Intel P4: 4
C Intel Core 2: 2.4
C all benchmarks on tesla: E6600 (2.4GHz) w/DDR800 dual-channel. system not idle, azureus running.
C MMX: (Conroe 2.4GHz. calls to dynamically linked lib function)
C size 1 18.000 cycles/limb
C size 2 9.960-10.020 cyc/limb
C size 49: 2.673 cycles/limb
C size 496: 2.465 cycles/limb
C size 1600: 2.402 cycles/limb
C MMX: (Conroe 2.4GHz. calls to statically linked lib function)
C size 1: 16.530 cycles
C size 2: 9.547 cycles/limb
C size 4: 5.775 cycles/limb
C size 496: 2.445 cycles/limb (same on 2.8GHz Harpertown)
C size 10000000: 14.823 c/l (dual-channel DDR800, g965 chipset)
C MMX: (K8 2.6GHz (ACT cluster). calls to statically linked lib function.)
C size 496: 2.553 cycles/limb
C this SSE2 version: requires 16byte aligned input and output
C +++++++++++++++++++++ shufpd version with slow (%cl) intro ++++++++++++++++++++
C AMD K8 (awarnach master node, 2.0GHz, Solaris)
C size 1: 14-15 cycles
C size 2: 8.025 cycles/limb
C size 49: 4.337 cycles/limb
C size 496: 3.808 cycles/limb
C size 496, ACT 2.6G 3.788 cycles/limb. Linux on dual 2218 2.6GHz.
C size 1600: 3.652-3.802 cycles/limb
C size 4000: 3.751 cycles/limb
C size 496001: 12.807 cycles/limb
C Intel Core 2(Conroe 2.4GHz. calls to statically linked .o)
C size 1: 13.024 cycles.
C size 2: 6.5-6.6 cycles/limb
C size 4: 4.275 cycles/limb (took a long time to settle. often swung up to 8.800. prob. branch pred problems)
C size 49: 2.245 cycles/limb
C size 496,800: 2.064 cycles/limb
C size 1600: 1.981-2.021 cycles/limb
C size 4000: 2.401 cycles/limb
C size 496001: 8.607 cycles/limb
C Intel Core 2(Harpertown 2.8GHz, system idle. calls to statically linked .o)
C size 496: 2.087 cycles/limb (more measurement overhead?)
C ++++++++++++++++++++ fast intro (%ecx), new loop, shufpd (and sometimes movhlps) version +++++++++++++++++++
C AMD K8 (Opteron 2218)
C size 1 15.938 cycles ( movhlps)
C size 4 5.947 cycles/limb (shufpd)
C size 496 3.791 cycles/limb (3.044 movhlps, still slower than MMX)
C Core 2 Conroe (2.4GHz)
C size 1 12.224 cycles (11.968 with pxor after the loop)
C size 2 6.000 c/l
C size 3 5.333 c/l
C size 4 4.0 cycles/limb (6.420 with movhlps)
C size 496 2.052 cycles/limb (2.036-2.068 movhlps)
C size 10000000 10.820 c/l (dual-channel DDR800, g965 chipset)
C Core 2 Harpertown (2.8GHz)
C size 1 12.049 cycles
C size 4 4.013 cycles/limb (shufpd)
C size 496 2.062 cycles/limb (shufpd/movhlps)
C movdqu (unaligned allowed, times for the aligned case) (Conroe 2.4GHz)
C size 1: 14.000 cycles/limb
C size 2: 6.990-7.050 cycles/limb
C size 496, 4000: 4.048 cycles/limb
C size 496001: 8.787-8.807 cycles/limb
C psllq has a latency of 2 cycles, throughput 1
C 2c/limb = 4c/2limb = 4cycles for the whole loop
C need to hide the latency of the shufpd? (lat=1, through=1)
C reversing the order of the shifts doesn't help.
C 4 cycles ?= latency chain of movdqa (load), shufpd, psllq, por(?)
C further gains would take more loop unrolling or software pipelining to hide the latency of the load
C which would require a bigger intro/outro to not read past the end of the array
C currently the function fits nicely into 128bytes
C optimization: could maybe structure things so limb0,limb1 need the shuffle, to hide the latency.
C probably would need another shuffle before storing, though
C INPUT PARAMETERS
C rp rdi
C up rsi
C n rdx C not allowed to be negative. not doubling as a sign bit.
C cnt rcx
C could probably make seriously optimized versions for multiple-of-8 shift counts using SSE2 shuffle instructions.
C note: %6 = limb0, limb1 means the _high_ quad of %xmm6 = limb1. Intel is little-endian, so this is a bit bogus. It actually matters if you use psrldq or something.
movdqa (%rsi), %xmm3 C %3 = limb0, limb1
movd %ecx, %xmm1
sub $64, %ecx C cnt must be <=64, so it's ok to operate on small version of it
neg %ecx C we want 64-cnt in ecx as a shift count for getting the return value
movq %xmm3, %rax C %rax = limb0
movd %ecx, %xmm0 C %0=64-cnt=left count=lc; %1=cnt; C this can go anywhere before the loop.
shlq %cl, %rax C return value=limb0<<lc. shift count has to be in %cl. C modifies flags, so it has to go before the loop
leaq (%rsi,%rdx,8), %rsi C rsi = rsi + rdx*8 = last limb + 1
leaq (%rdi,%rdx,8), %rdi
C rsi=last srclimb+1; rdi=last destlimb+1; rdx=-n; %0=64-cnt; %1=cnt;
addq $2, %rdx C n= -n + 2
jge L(out) C skip the loop if n<=2
ALIGN(16) C minimal alignment for claimed speed
movdqa (%rsi,%rdx,8), %xmm7 C %7=limb2,limb3. If src is 16byte aligned, this can't cross a page boundary and segfault. we might harmlessly read past the end of the array of limbs if its length is odd.
C good: pshufd works, and gave 2.0 cycles/limb on conroe, but 2.5 on Harpertown. (?!). 3.0 on K8
C pshufd $14, %xmm3, %xmm6 C %6=limb1, 14 = (2 + 3<<2) C latency=2 on pre-penryn
C the best alternative for getting limb1 in the low part of a register
C pxor %xmm6, %xmm6 C break the partial-reg dependency. 2.0 cycles/limb instead of 2.5 on conroe and penryn
C movhlps %xmm3, %xmm6
C require %6=limb1,xxx; %7=limb
C punpcklqdq %xmm7, %xmm6 C dest=dest,src
movdqa %xmm3, %xmm6
C require: %3=limbs(0,1); %6=xxx,limb1; %7=limbs(2,3) C shufpd version ran 2.0 cycles/limb on conroe and penryn, but 3.7 instead of 3.0 on K8
shufpd $1, %xmm7, %xmm6 C %6=limbs(1,2). take dest,src, so op=1+0<<1
C bad pslldq $4, %xmm2 C only operates in-place and has latency=2 on pre-penryn
C require: %3=limb0,limb1; %6 = limb1,limb2
psllq %xmm0, %xmm6
psrlq %xmm1, %xmm3 C %3=limbs(0,1)>>cnt; %6=limbs(1,2)<<lc
por %xmm6, %xmm3 C %3=result limbs(0,1)
movdqa %xmm3, -16(%rdi,%rdx,8) C store the result.
C require %7=limbs(2,3) (next iteration's limbs(0,1))
C movdqa %xmm7, %xmm6 C setup for next iter.
movdqa %xmm7, %xmm3
addq $2, %rdx
C %3=limb0,limb1 (since it was loaded before rdx was incremented).
C seems to make no diff where we put pxor, so move it to function start if that helps alignment
pxor %xmm6, %xmm6 C we need this for later, in L(out).
movdqa %xmm3, %xmm2
punpckhqdq %xmm6, %xmm3 C %6=0,0; %3=limb1,0
psllq %xmm0, %xmm3 C %3=limb1<<lc,0
psrlq %xmm1, %xmm2 C %2=(limb0,1)>>c
C require: %3=limb1<<lc, 0; %2=limb0>>c,limb1>>c
por %xmm2, %xmm3 C %3=result limbs 0,1
jg L(endo) C n = 1 limb left C condition flags still set from loop counter
movdqa %xmm3, -16(%rdi) C store the result.
movq %xmm3, -8(%rdi) C an odd limb to store
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 351 bytes
Desc: Digital signature
Url : http://gmplib.org/list-archives/gmp-devel/attachments/20080318/5985fa69/attachment.bin
More information about the gmp-devel