SSE2 rshift for 64bit Core2
peter at cordes.ca
Wed Mar 19 06:56:05 CET 2008
On Wed, Mar 19, 2008 at 01:55:04AM -0300, peter wrote:
> On Tue, Mar 18, 2008 at 04:19:36PM -0300, Peter Cordes wrote:
> > The other major limitation of my code currently is that it requires the
> > input and output operands to be 16 byte aligned. I use movdqa to load and
> > store. It might be possible to maintain 2.0 c/l on Core2 while using 64 bit
> > loads and stores.
> I rewrote the loop to use movq 64bit loads, instead of 128bit movdqa, but I
> haven't eliminated the movdqa for stores. And any sane caller will have
> their 64bit limbs aligned to 64bits, which is all movq needs to go fast.
> Just changing movdqa to movdqu for stores give 4.0 c/l instead of 2, IIRC.
> So maybe we can unpack the result and use two movqs.
replacing movdqa with movq; movhpd works reasonably well. It's slower:
~2.132 c/l instead of ~2.052, at n=496. Still 2.562 c/l on Harpertown, and
3.052 c/l on K8.
BTW timings are from shift.c:
mpn_rshift_sse2: 532ms (2.132 cycles/limb)
I try to pick a low time that comes up frequently, which is probably good
when there are a couple common ones resulting from 1ms differences. If I
thought I wasn't getting precise enough results, I'd have shift.c run the
I hope there's a way to tweak it back to 2.0 c/l for the large-but-cached
case. I think for the too-large-for-cache case there's no way to match the
performance of movdqa. It might make sense for rshift to detect really
large argument sizes and use a different version of the function with loads
and/or stores that bypass the L1 (and maybe L2) cache, to avoid useless
I don't have any code that uses rshift or lshift; I'm just hacking for the
fun of it.
C version using unaligned loads but aligned stores:
C K8 (2.6 GHz)
C size 496: 3.052 c/l. (no ALIGN: 3.547, movhpd: 3.048) old: 3.039c/l with store commented out.
C size 10000 5.244 c/l (no ALIGN: 5.257. movhpd: 5.248)
C size 10000000 14.053 c/l. (no ALIGN: 14.529. movhpd: 14.005)
C size 3 5.376 c/l (no ALIGN: 5.312. unaligned movhpd stores: 5.355)
C size 4 4.040 c/l (no ALIGN: 3.980, and one less icache line touched?) (movhpd: 4.180)
C size 496: 2.052 c/l. (no ALIGN: 2.052) (movhpd: 2.132)
C size 10000 2.320 c/l (no ALIGN: 2.320) (movhpd: 2.656)
C size 10000000 11.178 c/l. (no ALIGN: 11.195. movhpd: 11.520)) (2.4GHz, g965, dual channel DDR800)
C Harpertown (2.8GHz):
C size 496 2.581 c/l. (no ALIGN: 2.558)
C size 10000 2.847 c/l (no ALIGN: 2.968. movhpd: 3.566)
C size 10000000 14.460 c/l. (no ALIGN: 14.403. movhpd: 14.294)
... in the loop
por %xmm3, %xmm6
movq %xmm6, (%rdi,%rdx,8) C store the result.
movhpd %xmm6, 8(%rdi,%rdx,8) C store the result.
... after the loop
jg L(unal_endodd) C n = 1 limb left C condition flags still set from loop counter
C movdqa %xmm6, (%rdi) C store the result.
movq %xmm6, (%rdi) C store the result.
movhpd %xmm6, 8(%rdi) C store the result.
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter at cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 351 bytes
Desc: Digital signature
Url : http://gmplib.org/list-archives/gmp-devel/attachments/20080319/a64cd16a/attachment-0001.bin
More information about the gmp-devel