SSE2 rshift for 64bit Core2

Fri Mar 21 23:19:14 CET 2008

Peter Cordes <peter at cordes.ca> writes:

  > We don't have that problem for Core2.  The dual-word shrd/shld
  > instructions are actually well implemented here, and should allow us
  > to approach 1 cycle/limb.

   You're right.  I wrote another version of rshift, this time based on
  mpn/x86/rshift.asm (simple pipelined shrd loop).  Initially I was getting
  1.89 cycles/limb, but I think the bottleneck was all the uops[1] that
  addressing modes like (%rdi, %rdx, 8) generate.  code here:
  http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/rshift.asm?c=annotate&p=20080319184937-0feb0-8597190e032a0a354173afcccc8a9cc669952dfb.gz
  Using 8(%rdi), etc. and add $16, %rdi  at the end of the loop gave me 1.79c/l.

   Unrolling the loop to do 4 limbs per iteration sped it up to ~1.571 c/l.
  That's the best I've managed.  Unrolling further makes the loop more than 18
  instructions, and more than 64bytes, so it doesn't fit in Core 2's loop
  stream detector buffer (which caches the pre-decoded instructions, avoiding
  the instruction fetch stages while the loop is running).  At 8
  limbs/iteration, it runs slower: ~1.67 c/l.

I reached 1.25 c/l with 4-way unrolling some time ago.  This is the
inner loop:

L(top): shld    %cl, %r8, %r11
        mov     (up), %r10
        mov     %r11, (rp)
L(10):  shld    %cl, %r9, %r8
        mov     -8(up), %r11
        mov     %r8, -8(rp)
L(01):  shld    %cl, %r10, %r9
        mov     -16(up), %r8
        mov     %r9, -16(rp)
L(00):  shld    %cl, %r11, %r10
        mov     -24(up), %r9
        lea     -32(up), up
        mov     %r10, -24(rp)
        lea     -32(rp), rp
        sub     $4, n
        jnc     L(top)

Possibly, this could approach 1 c/l with greater unrolling, but
perhaps the 64 byte loop buffer limit will actually make 4-way
unrolling optimal (the loop is 60 bytes).

   A very interesting discovery is that the SSE2 version (with 8byte loads and
  stores, so no alignment requirements) is faster for large n. If we're
  working from L2 cache, because args don't fit in L1, sse2 goes ~9c/l, shrd
  goes ~13c/l.  With args too big for L2, sse2 goes ~11.5c/l, while shrd goes
  ~14c/l.  Core2 must do something differently when loading to xmm registers
  with movq, or when storing with movq/movhpd, compared to 64bit loads/stores
  from the integer registers.  Or maybe just the OOO execution engine stalls
  differently in the two loops.  I haven't done much with oprofile on this
  version.  Maybe partly because I'm starting to remember more easily how Core
  2's pipeline works generally.  Anyway, I put a size test in the code to use
  the sse2 version if n is >= 16000 limbs.

I suggest that you use tune/speed for speed measuring (and devel/try
for correctness checking).

The loop above runs at 1.25 c/l from L1, and a bit over 3 c/l from L2,
and 13 c/l on my system from main memory.  The main memory time will
likely vary a lot with CPU frequency and memory subsystem frequency.

   All times are for Conroe, and were the same on Harpertown(Penryn).  (except
  the sse2 times, which are slower on Harpertown than Conroe.)  K8 is hopeless
  on this code, too: ~4.5 c/l for n=496.

I believe shifting is the one GMP operation where the Core2 family
can the K8 family.

I counted my loop that way and it seemed sensible that
  all the (%rdi, %rdx, 8) addresses + the shifts would keep all three ALU
  execution units busy full time and achieve however many clocks/limb I was
  getting.

I doubt it.  Using a plain adressing like (reg) and then increment or
decrement that address register in an (unrolled) loop can be
beneficial for P6 family processors for one reason only: Lack of
register read ports.  A recently modified register is read through a
forwarding path, but a base register will need to be read from the
register bank.  There are only two read ports, and register read
contention is a common cause for lower-than-expected performance.

   I wonder if it would be possible to gain any speed with a loop that mixed
  ALU (shrd) with SSE2 shifts.  Maybe split the src in 1/2 or 2/3, and do the
  top part with SSE2, and the bottom part with shrd.  You could choose the
  split point so the dest was 16byte aligned, too.

Nice idea, but I think it will not be faster than using just shXd,
since that loop transfers close to 128 bits/cycle, presumably the L1
cache bandwidth.

-- 
Torbjörn