rshift for 64bit Core2
Peter Cordes
peter at cordes.ca
Sun Mar 23 07:17:00 CET 2008
New record: 8-limb unrolled at 1.663 c/l (measured with speed.c). See below.
On Sat, Mar 22, 2008 at 07:05:03PM -0300, Peter Cordes wrote:
> Pipelining it deeper might get it down even closer to 1c/l, even without
> unrolling.
Using a longer software pipeline (code below) kills performance from L2.
By the time the register will be used, it's cold. SIZE=10001 gets 3.12c/l
instead of 2.464c/l for the 4-limb loop I posted before, or for the below
8-limb loop pipelined one less deep. Main mem is still ~10.8c/l. You get
tons of ROB read port stalls, almost one/clock on some instructions, but
it's still keeping up with main mem.
A pipeline depth of 2 (other) loads between loading and using a register
seems to be optimal. This makes perfect sense, because Core 2's load
latency from L1 is 3 cycles.
Pipelining the 4-limb version this way hits 1.281 c/l. That's 36206/9231 =
3.922 uops/clock, which is slightly higher than the 3.90 uops/clock we hit
before. It's probably as close to maxing out the decode bandwidth as we're
going to get.
An 8-limb unrolled loop can hit 1.201 c/l, at 25831uop/ 6895clock counts, for
3.74 uop/clock. Probably even a 5x or 6x unrolled loop could match that. I
haven't been able to get it any faster than that. I profiled it, but I
can't tell what's stopping the 8-unrolled loop from going any faster. I
think ROB read port stalls start to become the limiting factor, though.
Ubuntu's oprofile seems to be broken for some events, like INST_RETIRED.
Maybe the retirement instead of decode bandwidth is the limit for the 8-limb
loop. Or maybe since we're not streaming from the loop buffer any more,
bubbles in the instruction stream to the decoders are the limiting factor.
I commented it out and went back to 4-limb unrolled, then I had trouble
getting the 8-limb loop back to its former speed. It turns out it's
sensitive to exactly where you mix in the lea instructions to increment the
pointers. (They can go anywhere, you just subtract 64 from the offsets of
anything you move them above.) Moving things around can make it as bad a
1.4c/l, and it's pretty easy to get 1.3c/l. Usually there's a lot of ROB
read port stall hits at one spot in the loop when it's not running at full
speed.
I now have a version[2] that hits 1.1663 c/l (with speed.c). ROB read
port stalls happen 1/100 clock cycles. 67300uop counts / 17319 clock counts
= 3.8859 uops/clock. I attached the output of opannotate --assembly (note
that only clocks and uops have the same counter; the others trigger much
more often just to see them at all.)
Ok, so I finally got around to using tune/speed-ext.c. It's not in my
repository, but should make it there soon. For the benefit of the person
who used darcs to get a copy of my code, according to my web server log. :)
Whatever I do, this is going to need cleanup code at the end, because any
pipelined loop will read off the end of src array unless it stops early. So
maybe I'll have a hard time fitting it in two cache lines with a 4-limb
unroll after all.
I tried making a version of the intro that used (%rsi,%rdx,8) addressing,
with only one loop counter ALU insn, but the setup for it takes longer than
it saves except when it's doing about 6 or more iterations. I could
probably shave some off that, which might be worth trying if we use it for
setup and cleanup.
--
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter at cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
[1]
8-times unrolled, software pipelined. Note lea instructions mixed in
through the loop instead of in a cluster at the end. That was creating
ROB read port stalls at the top of the loop.
C require: reg1=limb1; reg2=limb2; reg3=limb3; reg4=limb4; reg8=limb0
C over-pipelined loop
ALIGN(16)
L(c2_loop): shrd %cl, reg1, reg8
mov (%rsi), reg5
C add $32, %rsi
mov reg8, (%rdi)
C L(c2_10):
shrd %cl, reg2, reg1
mov (8-0)(%rsi), reg6
mov reg1, (8-0)(%rdi)
C add $32, %rdi
C L(c2_01):
shrd %cl, reg3, reg2
mov (16-0)(%rsi), reg7
lea 64(%rsi),%rsi
mov reg2, (16-0)(%rdi)
C L(c2_00):
shrd %cl, reg4, reg3
C sub $4, n
mov (24-64)(%rsi), reg8
mov reg3, (24-0)(%rdi)
shrd %cl, reg5, reg4
mov (32-64)(%rsi), reg1
mov reg4, (32-0)(%rdi)
C L(c2_10):
lea 64(%rdi),%rdi
shrd %cl, reg6, reg5
mov (40-64)(%rsi), reg2
mov reg5, (40-64)(%rdi)
C C L(c2_01):
shrd %cl, reg7, reg6
mov (48-64)(%rsi), reg3
mov reg6, (48-64)(%rdi)
C C L(c2_00):
shrd %cl, reg8, reg7
sub $8, n
mov (56-64)(%rsi), reg4
mov reg7, (56-64)(%rdi)
jg L(c2_loop)
[2] Super-fast 8-limb version:
ASM_START()
C TODO: remove this, or make it 16
ALIGN(1024)
C ALIGN(16)
PROLOGUE(mpn_rshift_core2)
C C would like to get lots of instructions into the OOO execution engine early so it has plenty to work on...
C cmp $16000, %rdx
C jge mpn_rshift_sse2 C TODO: and not penryn/harpertown unless we can use the aligned version
C regs for the loop. use macros to make register reallocation easy.
define(n,%rdx)
define(reg1,%r9)
define(reg2,%rax)
define(reg3,%r8) C referenced the fewest times
define(reg4,%r11)
define(reg5,%r12)
define(reg6,%r13)
define(reg7,%r14)
define(reg8,%r15)
C define(reg8,reg4)
C shift count can stay where it is in %rcx
C push reg2
C push reg4
push %r12
push %r13
push %r14
push %r15
mov (%rsi), reg8 C reg8 = limb0
xor %eax, %eax
shrd %cl, reg8, %rax C %rax = ret val limb. %rbx still = limb0
push %rax C save retval
C mov %rsi, %r9
jmp L(c2_entry)
L(c2_shortloop):
shrd %cl, reg1, reg8
mov reg8, (%rdi)
mov reg1, reg8
add $8, %rdi
C add $8, %r9
L(c2_entry):
dec n C sub looks like it makes things align better, but dec has the same timings
C sub $1, n
jle L(c2_end)
add $8, %rsi
mov (%rsi), reg1 C reg8=limb0 reg1=limb1
test $7, %dl
jnz L(c2_shortloop)
C loop last iter stores ith limb to dest, and loads i+1st limb from src
C reg8=limb(i) reg1=limb(i+1). %rdx=n-i-1, %rdx%4=0 %rsi -> limb(i+1)
C mov (%rsi), reg1
mov 8(%rsi), reg2
mov 16(%rsi), reg3
C mov 24(%rsi), reg4
lea 24(%rsi), %rsi
C debug: %r9 = %rsi
C mov (%r9), reg8
C mov 8(%r9), reg1
C mov 16(%r9),reg2
C require: reg1=limb1; reg2=limb2; reg3=limb3; reg4=limb4; reg8=limb0
C loop is <= 18 insn and <= 4 16byte aligned blocks, so fits into Core 2's loop stream buffer, so alignment doesn't matter
C If this is misaligned so it doesn't fit in the loob buffer, it runs ~1.57 c/l instead of ~1.33
C further unrolling will push it beyond the size of the loop stream detector. (already close in bytes).
C 8 limbs/iter runs at 1.202 - 1.315 c/l with ALIGN(16). (slow intro loop has to do more, though...)
C use simple addressing modes, not (%rdi,%rdx,8). That generates too many reads of not-in-flight register values
ALIGN(16)
L(c2_loop): shrd %cl, reg1, reg8
mov reg8, (%rdi)
mov (%rsi), reg4
C L(c2_10):
C add $32, %rsi
shrd %cl, reg2, reg1
mov reg1, 8(%rdi)
mov (8-64)(%rsi), reg5
lea 64(%rsi),%rsi
C L(c2_01):
shrd %cl, reg3, reg2
C add $32, %rdi
mov reg2, (16-0)(%rdi)
mov (16-64)(%rsi), reg6
C L(c2_00):
shrd %cl, reg4, reg3
C sub $4, n
mov reg3, (24-0)(%rdi)
mov (24-64)(%rsi), reg7
shrd %cl, reg5, reg4
mov (32-64)(%rsi), reg8
mov reg4, (32-0)(%rdi)
lea 64(%rdi),%rdi
C L(c2_10):
shrd %cl, reg6, reg5
mov (40-64)(%rsi), reg1
mov reg5, (40-64)(%rdi)
C L(c2_01):
shrd %cl, reg7, reg6
mov (48-64)(%rsi), reg2
mov reg6, (48-64)(%rdi)
C L(c2_00):
shrd %cl, reg8, reg7
sub $8, n
mov (56-64)(%rsi), reg3
mov reg7, (56-64)(%rdi)
jg L(c2_loop)
C ALIGN(8) would align the branch target. only needed if near the end of a 16byte fetch, causing a bubble.
C L(c2_endshort):
L(c2_end):
pop %rax C return val
shr %cl, reg8 C compute most significant limb
mov reg8, (%rdi) C store it
C pop reg4
C pop reg2
pop %r15
pop %r14
pop %r13
pop %r12
ret
EPILOGUE()
-------------- next part --------------
/*
* Command line: opannotate --assembly speed-ext
*
* Interpretation of command line:
* Output annotated assembly listing with samples
*
* CPU: Core 2, speed 1596 MHz (estimated)
* Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 4000000
* Counted RAT_STALLS events (Partial register stall cycles) with a unit mask of 0x01 (ROB read port) count 200000
* Counted RESOURCE_STALLS events (Cycles during which resource stalls occur) with a unit mask of 0x04 (during which the pipeline has exceeded the load or store limit or is waiting to commit all stores) count 200000
* Counted UOPS_RETIRED events (number of UOPs retired) with a unit mask of 0x0f (Micro-ops retired) count 4000000
*/
:
:/usr/local/src/gmp/gmp-4.2.2/tune/speed-ext: file format elf64-x86-64
:
:Disassembly of section .init:
:Disassembly of section .plt:
:Disassembly of section .text:
:
0000000000404400 <mpn_rshift_core2>: /* mpn_rshift_core2 total: 17319 89.5455 1716 80.9816 759 100.000 67300 97.1869 */
25 0.1293 0 0 0 0 0 0 : 404400: push %r12
: 404402: push %r13
: 404404: push %r14
: 404406: push %r15
17 0.0879 0 0 0 0 0 0 : 404408: mov (%rsi),%r15
: 40440b: xor %eax,%eax
: 40440d: shrd %cl,%r15,%rax
14 0.0724 0 0 0 0 0 0 : 404411: push %rax
13 0.0672 0 0 0 0 45 0.0650 : 404412: jmp 404422 <mpn_rshift_core2+0x22>
: 404414: shrd %cl,%r9,%r15
: 404418: mov %r15,(%rdi)
: 40441b: mov %r9,%r15
: 40441e: add $0x8,%rdi
16 0.0827 0 0 0 0 0 0 : 404422: dec %rdx
: 404425: jle 4044bc <mpn_rshift_core2+0xbc>
: 40442b: add $0x8,%rsi
: 40442f: mov (%rsi),%r9
15 0.0776 0 0 0 0 0 0 : 404432: test $0x7,%dl
: 404435: jne 404414 <mpn_rshift_core2+0x14>
: 404437: mov 0x8(%rsi),%rax
: 40443b: mov 0x10(%rsi),%r8
16 0.0827 0 0 0 0 0 0 : 40443f: lea 0x18(%rsi),%rsi
: 404443: nopw %cs:0x0(%rax,%rax,1)
1818 9.3997 35 1.6517 45 5.9289 7340 10.5996 : 404450: shrd %cl,%r9,%r15
24 0.1241 3 0.1416 10 1.3175 73 0.1054 : 404454: mov %r15,(%rdi)
13 0.0672 22 1.0382 33 4.3478 25 0.0361 : 404457: mov (%rsi),%r11
1860 9.6169 83 3.9169 30 3.9526 7714 11.1397 : 40445a: shrd %cl,%rax,%r9
15 0.0776 4 0.1888 4 0.5270 6 0.0087 : 40445e: mov %r9,0x8(%rdi)
9 0.0465 64 3.0203 64 8.4321 36 0.0520 : 404462: mov -0x38(%rsi),%r12
1846 9.5445 184 8.6833 21 2.7668 6941 10.0234 : 404466: lea 0x40(%rsi),%rsi
1 0.0052 3 0.1416 1 0.1318 3 0.0043 : 40446a: shrd %cl,%r8,%rax
28 0.1448 49 2.3124 30 3.9526 30 0.0433 : 40446e: mov %rax,0x10(%rdi)
1821 9.4152 135 6.3709 76 10.0132 6950 10.0364 : 404472: mov -0x30(%rsi),%r13
1 0.0052 7 0.3303 7 0.9223 3 0.0043 : 404476: shrd %cl,%r11,%r8
14 0.0724 1 0.0472 0 0 11 0.0159 : 40447a: mov %r8,0x18(%rdi)
1906 9.8547 232 10.9486 61 8.0369 7414 10.7064 : 40447e: mov -0x28(%rsi),%r14
26 0.1344 2 0.0944 2 0.2635 97 0.1401 : 404482: shrd %cl,%r12,%r11
26 0.1344 0 0 0 0 79 0.1141 : 404486: mov -0x20(%rsi),%r15
1880 9.7203 257 12.1284 43 5.6653 7889 11.3924 : 40448a: mov %r11,0x20(%rdi)
30 0.1551 277 13.0722 14 1.8445 101 0.1459 : 40448e: lea 0x40(%rdi),%rdi
15 0.0776 0 0 0 0 2 0.0029 : 404492: shrd %cl,%r13,%r12
1819 9.4049 93 4.3889 89 11.7260 7201 10.3989 : 404496: mov -0x18(%rsi),%r9
34 0.1758 40 1.8877 16 2.1080 118 0.1704 : 40449a: mov %r12,-0x18(%rdi)
88 0.4550 43 2.0293 16 2.1080 237 0.3422 : 40449e: shrd %cl,%r14,%r13
1768 9.1412 9 0.4247 12 1.5810 6622 9.5627 : 4044a2: mov -0x10(%rsi),%rax
13 0.0672 2 0.0944 54 7.1146 47 0.0679 : 4044a6: mov %r13,-0x10(%rdi)
75 0.3878 81 3.8226 38 5.0066 192 0.2773 : 4044aa: shrd %cl,%r15,%r14
1741 9.0016 0 0 21 2.7668 6852 9.8949 : 4044ae: sub $0x8,%rdx
10 0.0517 0 0 8 1.0540 43 0.0621 : 4044b2: mov -0x8(%rsi),%r8
86 0.4447 13 0.6135 18 2.3715 296 0.4274 : 4044b6: mov %r14,-0x8(%rdi)
5 0.0259 67 3.1619 38 5.0066 17 0.0245 : 4044ba: jg 404450 <mpn_rshift_core2+0x50>
11 0.0569 0 0 0 0 106 0.1531 : 4044bc: pop %rax
150 0.7756 6 0.2832 8 1.0540 545 0.7870 : 4044bd: shr %cl,%r15
: 4044c0: mov %r15,(%rdi)
53 0.2740 1 0.0472 0 0 220 0.3177 : 4044c3: pop %r15
: 4044c5: pop %r14
: 4044c7: pop %r13
: 4044c9: pop %r12
17 0.0879 3 0.1416 0 0 45 0.0650 : 4044cb: retq
: 4044cc: nop
: 4044cd: nop
: 4044ce: nop
: 4044cf: nop
:Disassembly of section .fini:
:
More information about the gmp-devel
mailing list