rshift for 64bit Core2

Sun Mar 23 07:17:00 CET 2008

 New record: 8-limb unrolled at 1.663 c/l (measured with speed.c).  See below.

On Sat, Mar 22, 2008 at 07:05:03PM -0300, Peter Cordes wrote:
>    Pipelining it deeper might get it down even closer to 1c/l, even without
>   unrolling.

 Using a longer software pipeline (code below) kills performance from L2.
By the time the register will be used, it's cold.  SIZE=10001 gets 3.12c/l
instead of 2.464c/l for the 4-limb loop I posted before, or for the below
8-limb loop pipelined one less deep.  Main mem is still ~10.8c/l.  You get
tons of ROB read port stalls, almost one/clock on some instructions, but
it's still keeping up with main mem.

 A pipeline depth of 2 (other) loads between loading and using a register
seems to be optimal.  This makes perfect sense, because Core 2's load
latency from L1 is 3 cycles.

 Pipelining the 4-limb version this way hits 1.281 c/l.  That's 36206/9231 =
3.922 uops/clock, which is slightly higher than the 3.90 uops/clock we hit
before.  It's probably as close to maxing out the decode bandwidth as we're
going to get.

 An 8-limb unrolled loop can hit 1.201 c/l, at 25831uop/ 6895clock counts, for
3.74 uop/clock.  Probably even a 5x or 6x unrolled loop could match that.  I
haven't been able to get it any faster than that.  I profiled it, but I
can't tell what's stopping the 8-unrolled loop from going any faster.  I
think ROB read port stalls start to become the limiting factor, though.
Ubuntu's oprofile seems to be broken for some events, like INST_RETIRED.
Maybe the retirement instead of decode bandwidth is the limit for the 8-limb
loop.  Or maybe since we're not streaming from the loop buffer any more,
bubbles in the instruction stream to the decoders are the limiting factor.

 I commented it out and went back to 4-limb unrolled, then I had trouble
getting the 8-limb loop back to its former speed.  It turns out it's
sensitive to exactly where you mix in the lea instructions to increment the
pointers.  (They can go anywhere, you just subtract 64 from the offsets of
anything you move them above.)  Moving things around can make it as bad a
1.4c/l, and it's pretty easy to get 1.3c/l.  Usually there's a lot of ROB
read port stall hits at one spot in the loop when it's not running at full
speed.

 I now have a version[2] that hits 1.1663 c/l (with speed.c).  ROB read
port stalls happen 1/100 clock cycles.  67300uop counts / 17319 clock counts
= 3.8859 uops/clock.  I attached the output of opannotate --assembly (note
that only clocks and uops have the same counter; the others trigger much
more often just to see them at all.)

 Ok, so I finally got around to using tune/speed-ext.c.  It's not in my
repository, but should make it there soon.  For the benefit of the person
who used darcs to get a copy of my code, according to my web server log. :) 

 Whatever I do, this is going to need cleanup code at the end, because any
pipelined loop will read off the end of src array unless it stops early.  So
maybe I'll have a hard time fitting it in two cache lines with a 4-limb
unroll after all.

 I tried making a version of the intro that used (%rsi,%rdx,8) addressing,
with only one loop counter ALU insn, but the setup for it takes longer than
it saves except when it's doing about 6 or more iterations.  I could
probably shave some off that, which might be worth trying if we use it for
setup and cleanup.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

[1]
8-times unrolled, software pipelined.  Note lea instructions mixed in
through the loop instead of in a cluster at the end.  That was creating
ROB read port stalls at the top of the loop.

C require: reg1=limb1; reg2=limb2; reg3=limb3;  reg4=limb4; reg8=limb0
C over-pipelined loop
ALIGN(16)
L(c2_loop):	shrd	%cl, reg1, reg8
	mov	(%rsi),		reg5
C  	add	$32,	%rsi
	mov	reg8,	(%rdi)
C	L(c2_10):
	shrd	%cl, reg2, reg1
	mov	(8-0)(%rsi),	reg6
	mov	reg1,	(8-0)(%rdi)
C  	add	$32,	%rdi
C	L(c2_01):
	shrd	%cl, reg3, reg2
	mov	(16-0)(%rsi),	reg7
  	lea	64(%rsi),%rsi
	mov	reg2,	(16-0)(%rdi)
C	L(c2_00):
	shrd	%cl, reg4, reg3
C  	sub	$4, n
	mov	(24-64)(%rsi),	reg8
 	mov	reg3,	(24-0)(%rdi)

 	shrd	%cl, reg5, reg4
 	mov	(32-64)(%rsi),	reg1
 	mov	reg4,	(32-0)(%rdi)
 C	L(c2_10):
 	lea	64(%rdi),%rdi
 	shrd	%cl, reg6, reg5
 	mov	(40-64)(%rsi),	reg2
 	mov	reg5,	(40-64)(%rdi)
C C	L(c2_01):
 	shrd	%cl, reg7, reg6
 	mov	(48-64)(%rsi),	reg3
 	mov	reg6,	(48-64)(%rdi)
C C	L(c2_00):
 	shrd	%cl, reg8, reg7
 	sub	$8, n
 	mov	(56-64)(%rsi),	reg4
 	mov	reg7,	(56-64)(%rdi)

	jg	L(c2_loop)

[2] Super-fast 8-limb version:
ASM_START()
C TODO: remove this, or make it 16
ALIGN(1024)
C ALIGN(16)
PROLOGUE(mpn_rshift_core2)
C 	C would like to get lots of instructions into the OOO execution engine early so it has plenty to work on...
C 	cmp	$16000,	%rdx
C 	jge	mpn_rshift_sse2		C TODO: and not penryn/harpertown unless we can use the aligned version

C regs for the loop.  use macros to make register reallocation easy.
define(n,%rdx)
define(reg1,%r9)
define(reg2,%rax)
define(reg3,%r8)  		C referenced the fewest times
define(reg4,%r11)
define(reg5,%r12)
define(reg6,%r13)
define(reg7,%r14)
define(reg8,%r15)
C define(reg8,reg4)
	C shift count can stay where it is in %rcx
C 	push	reg2
C 	push	reg4
 	push	%r12
 	push	%r13
 	push	%r14
 	push	%r15
	mov	(%rsi), reg8		C reg8 = limb0
	xor	%eax,	%eax
	shrd	%cl,	reg8,	%rax	C %rax = ret val limb.  %rbx still = limb0
	push	%rax			C save retval

C	mov	%rsi,	%r9
	jmp	L(c2_entry)
L(c2_shortloop):
	shrd	%cl,	reg1,	reg8
	mov	reg8,	(%rdi)
	mov	reg1,	reg8
	add	$8,	%rdi
C	add	$8,	%r9
L(c2_entry):
	dec	n		C sub looks like it makes things align better, but dec has the same timings
C 	sub	$1,	n
	jle	L(c2_end)
	add	$8,	%rsi
	mov	(%rsi),	reg1	C reg8=limb0 reg1=limb1
	test	$7,	%dl
	jnz	L(c2_shortloop)

C loop last iter stores ith limb to dest, and loads i+1st limb from src
C	reg8=limb(i) reg1=limb(i+1).  %rdx=n-i-1, %rdx%4=0  %rsi -> limb(i+1)

C  	mov	(%rsi),	reg1
  	mov	8(%rsi),	reg2
	mov	16(%rsi),	reg3

C 	mov	24(%rsi),	reg4
 	lea	24(%rsi),	%rsi

C debug: %r9 = %rsi
C 	mov	(%r9),	reg8
C 	mov	8(%r9),	reg1
C 	mov	16(%r9),reg2

C require: reg1=limb1; reg2=limb2; reg3=limb3;  reg4=limb4; reg8=limb0

C loop is <= 18 insn and <= 4 16byte aligned blocks, so fits into Core 2's loop stream buffer, so alignment doesn't matter
C If this is misaligned so it doesn't fit in the loob buffer, it runs ~1.57 c/l instead of ~1.33
C further unrolling will push it beyond the size of the loop stream detector.  (already close in bytes).
C 8 limbs/iter runs at 1.202 - 1.315 c/l with ALIGN(16).  (slow intro loop has to do more, though...)
C use simple addressing modes, not (%rdi,%rdx,8).  That generates too many reads of not-in-flight register values
   ALIGN(16)
L(c2_loop):	shrd	%cl, reg1, reg8
	mov	reg8,	(%rdi)
	mov	(%rsi),		reg4
C	L(c2_10):
C   	add	$32,	%rsi
	shrd	%cl, reg2, reg1
	mov	reg1,	8(%rdi)
	mov	(8-64)(%rsi),	reg5
   	lea	64(%rsi),%rsi
C	L(c2_01):
	shrd	%cl, reg3, reg2
C   	add	$32,	%rdi
	mov	reg2,	(16-0)(%rdi)
	mov	(16-64)(%rsi),	reg6
C	L(c2_00):
	shrd	%cl, reg4, reg3
C   	sub	$4, n
 	mov	reg3,	(24-0)(%rdi)
	mov	(24-64)(%rsi),	reg7

  	shrd	%cl, reg5, reg4
  	mov	(32-64)(%rsi),	reg8
  	mov	reg4,	(32-0)(%rdi)
  	lea	64(%rdi),%rdi
 C	L(c2_10):
  	shrd	%cl, reg6, reg5
  	mov	(40-64)(%rsi),	reg1
  	mov	reg5,	(40-64)(%rdi)
 C	L(c2_01):
  	shrd	%cl, reg7, reg6
  	mov	(48-64)(%rsi),	reg2
  	mov	reg6,	(48-64)(%rdi)
 C	L(c2_00):
  	shrd	%cl, reg8, reg7
  	sub	$8, n
  	mov	(56-64)(%rsi),	reg3
  	mov	reg7,	(56-64)(%rdi)

	jg	L(c2_loop)
C ALIGN(8)  would align the branch target.  only needed if near the end of a 16byte fetch, causing a bubble.
C L(c2_endshort):
L(c2_end):
	pop	%rax			C return val
	shr	%cl,	reg8		C compute most significant limb
	mov	reg8,	(%rdi)		C store it
C  	pop	reg4
C  	pop	reg2
	pop	%r15
	pop	%r14
	pop	%r13
	pop	%r12
 	ret
EPILOGUE()
-------------- next part --------------
/* 
 * Command line: opannotate --assembly speed-ext 
 * 
 * Interpretation of command line:
 * Output annotated assembly listing with samples
 * 
 * CPU: Core 2, speed 1596 MHz (estimated)
 * Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 4000000
 * Counted RAT_STALLS events (Partial register stall cycles) with a unit mask of 0x01 (ROB read port) count 200000
 * Counted RESOURCE_STALLS events (Cycles during which resource stalls occur) with a unit mask of 0x04 (during which the pipeline has exceeded the load or store limit or is waiting to commit all stores) count 200000
 * Counted UOPS_RETIRED events (number of UOPs retired) with a unit mask of 0x0f (Micro-ops retired) count 4000000
 */
                                                               :
                                                               :/usr/local/src/gmp/gmp-4.2.2/tune/speed-ext:     file format elf64-x86-64
                                                               :
                                                               :Disassembly of section .init:
                                                               :Disassembly of section .plt:
                                                               :Disassembly of section .text:
                                                               :
0000000000404400 <mpn_rshift_core2>: /* mpn_rshift_core2 total:  17319 89.5455  1716 80.9816   759 100.000 67300 97.1869 */
    25  0.1293     0       0     0       0     0       0       :  404400:	push   %r12
                                                               :  404402:	push   %r13
                                                               :  404404:	push   %r14
                                                               :  404406:	push   %r15
    17  0.0879     0       0     0       0     0       0       :  404408:	mov    (%rsi),%r15
                                                               :  40440b:	xor    %eax,%eax
                                                               :  40440d:	shrd   %cl,%r15,%rax
    14  0.0724     0       0     0       0     0       0       :  404411:	push   %rax
    13  0.0672     0       0     0       0    45  0.0650       :  404412:	jmp    404422 <mpn_rshift_core2+0x22>
                                                               :  404414:	shrd   %cl,%r9,%r15
                                                               :  404418:	mov    %r15,(%rdi)
                                                               :  40441b:	mov    %r9,%r15
                                                               :  40441e:	add    $0x8,%rdi
    16  0.0827     0       0     0       0     0       0       :  404422:	dec    %rdx
                                                               :  404425:	jle    4044bc <mpn_rshift_core2+0xbc>
                                                               :  40442b:	add    $0x8,%rsi
                                                               :  40442f:	mov    (%rsi),%r9
    15  0.0776     0       0     0       0     0       0       :  404432:	test   $0x7,%dl
                                                               :  404435:	jne    404414 <mpn_rshift_core2+0x14>
                                                               :  404437:	mov    0x8(%rsi),%rax
                                                               :  40443b:	mov    0x10(%rsi),%r8
    16  0.0827     0       0     0       0     0       0       :  40443f:	lea    0x18(%rsi),%rsi
                                                               :  404443:	nopw   %cs:0x0(%rax,%rax,1)
  1818  9.3997    35  1.6517    45  5.9289  7340 10.5996       :  404450:	shrd   %cl,%r9,%r15
    24  0.1241     3  0.1416    10  1.3175    73  0.1054       :  404454:	mov    %r15,(%rdi)
    13  0.0672    22  1.0382    33  4.3478    25  0.0361       :  404457:	mov    (%rsi),%r11
  1860  9.6169    83  3.9169    30  3.9526  7714 11.1397       :  40445a:	shrd   %cl,%rax,%r9
    15  0.0776     4  0.1888     4  0.5270     6  0.0087       :  40445e:	mov    %r9,0x8(%rdi)
     9  0.0465    64  3.0203    64  8.4321    36  0.0520       :  404462:	mov    -0x38(%rsi),%r12
  1846  9.5445   184  8.6833    21  2.7668  6941 10.0234       :  404466:	lea    0x40(%rsi),%rsi
     1  0.0052     3  0.1416     1  0.1318     3  0.0043       :  40446a:	shrd   %cl,%r8,%rax
    28  0.1448    49  2.3124    30  3.9526    30  0.0433       :  40446e:	mov    %rax,0x10(%rdi)
  1821  9.4152   135  6.3709    76 10.0132  6950 10.0364       :  404472:	mov    -0x30(%rsi),%r13
     1  0.0052     7  0.3303     7  0.9223     3  0.0043       :  404476:	shrd   %cl,%r11,%r8
    14  0.0724     1  0.0472     0       0    11  0.0159       :  40447a:	mov    %r8,0x18(%rdi)
  1906  9.8547   232 10.9486    61  8.0369  7414 10.7064       :  40447e:	mov    -0x28(%rsi),%r14
    26  0.1344     2  0.0944     2  0.2635    97  0.1401       :  404482:	shrd   %cl,%r12,%r11
    26  0.1344     0       0     0       0    79  0.1141       :  404486:	mov    -0x20(%rsi),%r15
  1880  9.7203   257 12.1284    43  5.6653  7889 11.3924       :  40448a:	mov    %r11,0x20(%rdi)
    30  0.1551   277 13.0722    14  1.8445   101  0.1459       :  40448e:	lea    0x40(%rdi),%rdi
    15  0.0776     0       0     0       0     2  0.0029       :  404492:	shrd   %cl,%r13,%r12
  1819  9.4049    93  4.3889    89 11.7260  7201 10.3989       :  404496:	mov    -0x18(%rsi),%r9
    34  0.1758    40  1.8877    16  2.1080   118  0.1704       :  40449a:	mov    %r12,-0x18(%rdi)
    88  0.4550    43  2.0293    16  2.1080   237  0.3422       :  40449e:	shrd   %cl,%r14,%r13
  1768  9.1412     9  0.4247    12  1.5810  6622  9.5627       :  4044a2:	mov    -0x10(%rsi),%rax
    13  0.0672     2  0.0944    54  7.1146    47  0.0679       :  4044a6:	mov    %r13,-0x10(%rdi)
    75  0.3878    81  3.8226    38  5.0066   192  0.2773       :  4044aa:	shrd   %cl,%r15,%r14
  1741  9.0016     0       0    21  2.7668  6852  9.8949       :  4044ae:	sub    $0x8,%rdx
    10  0.0517     0       0     8  1.0540    43  0.0621       :  4044b2:	mov    -0x8(%rsi),%r8
    86  0.4447    13  0.6135    18  2.3715   296  0.4274       :  4044b6:	mov    %r14,-0x8(%rdi)
     5  0.0259    67  3.1619    38  5.0066    17  0.0245       :  4044ba:	jg     404450 <mpn_rshift_core2+0x50>
    11  0.0569     0       0     0       0   106  0.1531       :  4044bc:	pop    %rax
   150  0.7756     6  0.2832     8  1.0540   545  0.7870       :  4044bd:	shr    %cl,%r15
                                                               :  4044c0:	mov    %r15,(%rdi)
    53  0.2740     1  0.0472     0       0   220  0.3177       :  4044c3:	pop    %r15
                                                               :  4044c5:	pop    %r14
                                                               :  4044c7:	pop    %r13
                                                               :  4044c9:	pop    %r12
    17  0.0879     3  0.1416     0       0    45  0.0650       :  4044cb:	retq   
                                                               :  4044cc:	nop    
                                                               :  4044cd:	nop    
                                                               :  4044ce:	nop    
                                                               :  4044cf:	nop    
                                                               :Disassembly of section .fini:
                                                               :