SSE2 rshift for 64bit Core2

Sat Mar 22 02:34:26 CET 2008

On Fri, Mar 21, 2008 at 11:19:14PM +0100, Torbjorn Granlund wrote:
> Peter Cordes <peter at cordes.ca> writes:
>> [...]
> I reached 1.25 c/l with 4-way unrolling some time ago.

 Is there an svn/git/whatever repository for gmp that has stuff like that in
it?  I wish I'd known there was already an optimized version.

>  This is the
> inner loop:
> 
> L(top): shld    %cl, %r8, %r11
>         mov     (up), %r10
>         mov     %r11, (rp)
> L(10):  shld    %cl, %r9, %r8
>         mov     -8(up), %r11
>         mov     %r8, -8(rp)
> L(01):  shld    %cl, %r10, %r9
>         mov     -16(up), %r8
>         mov     %r9, -16(rp)
> L(00):  shld    %cl, %r11, %r10
>         mov     -24(up), %r9
>         lea     -32(up), up
>         mov     %r10, -24(rp)
>         lea     -32(rp), rp
>         sub     $4, n
>         jnc     L(top)
> 
> Possibly, this could approach 1 c/l with greater unrolling, but
> perhaps the 64 byte loop buffer limit will actually make 4-way
> unrolling optimal (the loop is 60 bytes).

 Your loop doesn't look much different, except you don't reuse the same pair
of registers so much.  That lets you separate the loads and stores from the
shifts to pipeline it, so the OOO core has less re-ordering to do.  I
thought register renaming would take care of everything, but I guess not.
Your better-pipelined loop might unroll better than mine.

> The loop above runs at 1.25 c/l from L1, and a bit over 3 c/l from L2,
> and 13 c/l on my system from main memory.  The main memory time will
> likely vary a lot with CPU frequency and memory subsystem frequency.

 Of course.  My desktop is an E6600 Conroe (4MB L2) in a g965 chipset with
dual channel DDR800 memory.  (5-5-5-18 timings, Intel DG965WH motherboard.)
I was surprised how badly my loop did with L1 misses.  3c/l sounds more
reasonable.  I'll see how it compares with the SSE2 version.

>    All times are for Conroe, and were the same on Harpertown(Penryn).  (except
>   the sse2 times, which are slower on Harpertown than Conroe.)  K8 is hopeless
>   on this code, too: ~4.5 c/l for n=496.
> 
> I believe shifting is the one GMP operation where the Core2 family
> can the K8 family.

 Yeah, that's funny.  I'm just pointing out that all my code is only fast on
Core 2, since that's what I'm tuning for.  Not surprising, since I'm looking
at the Core 2 parts of Intel's manuals.  I'm not aiming for a balance
between K8 and Core 2 speed, so this will need to go in mpn/x86_64/core2 or
something.

 Maybe at some point I'll try to make a version tuned for K8.  I might wait
until I have access to some K10 hardware, too.

> I counted my loop that way and it seemed sensible that
>   all the (%rdi, %rdx, 8) addresses + the shifts would keep all three ALU
>   execution units busy full time and achieve however many clocks/limb I was
>   getting.
> 
> I doubt it.  Using a plain adressing like (reg) and then increment or
> decrement that address register in an (unrolled) loop can be
> beneficial for P6 family processors for one reason only: Lack of
> register read ports.

 Oh yeah, I did profile my SSE2 version with oprofile to look for ROB read
port stalls, but I haven't done this one yet.  A good doc on that is
http://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/cycle_accounting.pdf

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC