SSE2 rshift for 64bit Core2

Peter Cordes peter at
Wed Mar 19 06:56:05 CET 2008

On Wed, Mar 19, 2008 at 01:55:04AM -0300, peter wrote:
> On Tue, Mar 18, 2008 at 04:19:36PM -0300, Peter Cordes wrote:
> >  The other major limitation of my code currently is that it requires the
> > input and output operands to be 16 byte aligned.  I use movdqa to load and
> > store.  It might be possible to maintain 2.0 c/l on Core2 while using 64 bit
> > loads and stores.
>  I rewrote the loop to use movq 64bit loads, instead of 128bit movdqa, but I
> haven't eliminated the movdqa for stores.  And any sane caller will have
> their 64bit limbs aligned to 64bits, which is all movq needs to go fast.
> Just changing movdqa to movdqu for stores give 4.0 c/l instead of 2, IIRC.
> So maybe we can unpack the result and use two movqs.

 replacing movdqa with  movq; movhpd works reasonably well.  It's slower:
~2.132 c/l  instead of ~2.052, at n=496.  Still 2.562 c/l on Harpertown, and
3.052 c/l on K8.

BTW timings are from shift.c:
mpn_rshift_sse2:      532ms (2.132 cycles/limb)

 I try to pick a low time that comes up frequently, which is probably good
when there are a couple common ones resulting from 1ms differences.  If I
thought I wasn't getting precise enough results, I'd have shift.c run the
loop longer.

 I hope there's a way to tweak it back to 2.0 c/l for the large-but-cached
case.  I think for the too-large-for-cache case there's no way to match the
performance of movdqa.  It might make sense for rshift to detect really
large argument sizes and use a different version of the function with loads
and/or stores that bypass the L1 (and maybe L2) cache, to avoid useless
cache pollution.

 I don't have any code that uses rshift or lshift; I'm just hacking for the
fun of it.

C version using unaligned loads but aligned stores:
C K8 (2.6 GHz)
C size 496:     3.052   c/l. (no ALIGN: 3.547, movhpd: 3.048) old: 3.039c/l with store commented out.
C size 10000	5.244	c/l (no ALIGN: 5.257. movhpd: 5.248)
C size 10000000	14.053	c/l. (no ALIGN: 14.529. movhpd: 14.005)

C Conroe:
C size 3	5.376	c/l  (no ALIGN: 5.312. unaligned movhpd stores: 5.355)
C size 4	4.040	c/l  (no ALIGN: 3.980, and one less icache line touched?) (movhpd: 4.180)
C size 496:	2.052	c/l. (no ALIGN: 2.052) (movhpd: 2.132)
C size 10000	2.320	c/l  (no ALIGN: 2.320) (movhpd: 2.656)
C size 10000000 11.178  c/l. (no ALIGN: 11.195. movhpd: 11.520))  (2.4GHz, g965, dual channel DDR800)

C Harpertown (2.8GHz):
C size 496	2.581	c/l. (no ALIGN: 2.558)
C size 10000	2.847	c/l  (no ALIGN: 2.968.  movhpd: 3.566)
C size 10000000	14.460	c/l. (no ALIGN: 14.403. movhpd: 14.294)

... in the loop
	por	%xmm3,	%xmm6
	movq	%xmm6,	(%rdi,%rdx,8) C store the result.
	movhpd	%xmm6,	8(%rdi,%rdx,8) C store the result.

... after the loop
	jg	L(unal_endodd)			C n = 1 limb left 	C condition flags still set from loop counter
C	movdqa	%xmm6, (%rdi)	C store the result.
	movq	%xmm6,	(%rdi)	C store the result.
	movhpd	%xmm6,	8(%rdi)	C store the result.

#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor ,

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 351 bytes
Desc: Digital signature
Url : 

More information about the gmp-devel mailing list