rshift for 64bit Core2

Sat Mar 22 23:05:03 CET 2008

On Sat, Mar 22, 2008 at 02:46:20PM +0100, Torbjorn Granlund wrote:
> Peter Cordes <peter at cordes.ca> writes:
> 
>    Thanks for the starting point.  It only goes 1.33c/l for me, though.
>   (counting overhead, so maybe it's really 1.25).  I haven't profiled it yet,
>   either.
>   
> Yes, 1.25 seems a bit optimistic, remeasuring it now yields 1.30 c/l
> just before it starts missing L1 cache.

 I sometimes get 1.281c/l (at SIZE=1500).  But I'm still using shift.c, and the measurement
interval isn't very big when it's that fast.

 Yeah, with SIZE=1001, it seems more and more like I get 1.281-1.297.  (1001
is optimal for the intro: the setup loop does 0 shrd insns.)

>    Instead of trying to set up all the registers for a computed jump into the
>   middle of the loop, I just used a rolled-up loop that goes until n&3 == 0,
>   at which point we're ready to enter the unrolled loop.  It's not the fastest
>   possible for small n, but it's pretty good and doesn't fill up the Icache.
>   It could also be used as a cleanup loop at the end, since it already tests
>   for n==0 as an exit condition.
> 
> You'll find many variants for handling this in existing GMP assembly
> code.  For 4-way unrolling, I prefer a little decision tree, comparing
> (n mod 4) to 2 first, which sets condition code for deciding if n mod
> 4 is 3, 2, or smaller than 1.

 I already came up with that myself. :)  That's the triple-jcc column on the
benchmark numbers (with my not-so-well-pipelined loop).  It did worse than
the intro-loop in nearly every case.  I think it would do even worse on the
pipelined loop that needs more setup (limbs in registers).  IIRC, Core 2
doesn't really like branch instructions with nothing in between.

 The intro loop doesn't just figure out where to jump into the main loop, it
actually does useful work.  It also starts loading from successive source
addresses pretty early, which gets the hardware prefetcher going.  I think
for the shift functions it's going to be hard to beat, esp. considering the
advantage of small size -> less icache pollution.  With the 4-way unrolled
loop and the short-loop intro, the whole function fits in 118 bytes, or two
icache lines, which seems reasonable.

 With hot caches and branch predictor, I bet the fastest code on the
synthetic benchmark would be a test $1,%dl; jz   into a two-way unrolled
intro loop that ran until n mod whatever == 0, then enter the main loop at
the top.  Register allocation could probably be arranged so you don't need
too many moves between loops.  This might be worth it if we go with an 8-way
unroll of the main loop.  I did some timing with 8-way vs. 4-way unroll for
small n[1], and the simple intro loop costs up to 8 cycles (depending on n
mod 8 vs. whatever is good/bad for the alternatives).

 In real life, shift is probably mixed in with plenty of other code.  For
small n, trading more icache misses for maybe 5 cycles total doesn't seem
worth it.  For large n, the intro becomes a smaller and smaller fraction.

> I then jump into the loop from the four cases.  I avoid a special
> loop, since that adds considerable overhead.  (The overhead for a
> short loop is typically large, partially due to branch prediction
> problems.)

 re: branch prediction, and the sub-thread Niels started:
>This requires a complex branch
> predictor that can match a pattern of taken and non-taken (taken 5            
> times, non-taken, taken 5 times, non-taken ...)

 Core 2's branch predictor detects patterns of length <= 16.  That's why the
optimization manuals suggest unrolling loops of known length until they need
<= 16 iterations.  Modern branch predictors are very fancy, and probably
worth every transistor. :)

 This helps us only when rshift is called multiple times with args of the
_same length_.  I suppose this is the most common case, though.  None of the
branches are very far, and they're all in the same cache line (or to the
next one, when jumping to the end of the function from the intro loop).  So
the mispredict penalty should be low, and the processor can easily speculate
on the branch going the other way.

 It might be worth testing by putting some code outside the timing loop to
overflow the branch-target buffer, to see how different intros do with a
cold branch predictor.

 The problem is that I don't have any other intros that set up all the limbs
in the right registers for the pipelined loop.  And just putting in
not-very-tight setup code into the computed-goto or decision-tree versions
doesn't seem fair.  I could try all three with my not-so-pipelined loop as
the main loop, since it would be easy to put my simple-loop intro in front
of it.  1.5c/l vs. 1.3 c/l won't be a big deal for n=1..10, compared to
mispredicted branches.

 The Intel manual claims Core 2 dynamically predicts all branches, even if
the BP unit hasn't seen them before.  It says the BPU looks ahead further
than the rest of the ifetch hardware, so it can see branches coming up
before it gets to them.  I was never clear on exactly what that meant...

> Whether to put the n mod 4 code before or after the unrolled loop is
> another question.  One may avoid the n mod X computation by doing the
> odd stuff after the unrolled loop, this is particularly important if
> the unrolling factor is not a power of 2.
> 
> If the unrolling factor is greater than 4, a computed goto or a branch
> table is usually fastest.  But one should then perhaps avoid the
> branch table for small n, to avoid its overhead.
> 
>    It also makes it easy to try different loop unrolling, because the loop
>   only has to be entered in one place.  Unrolling to 8limbs/iter can hit 1.20c/l.
>   It requires the loop body to be ALIGN(16)ed, though.  Without that, it's
>   slower than 4limbs/iter.
>   
> Perhaps loop aligning becomes more critical when the loop > 64 bytes?

 See the Intel optimization manual, sections 2.1.2.2 (Instruction fetch
unit) and 2.1.2.3 (Instruction Queue).

 If the jump to the top of the loop is to an address = to 12 mod 16 for
example, and it only contains 1 instruction before the 16-byte boundary, then
the instruction pre-decode stage will only deliver one instruction to the
instruction queue in that clock cycle.  The instructions used in the loop
are mostly of size 4 (so 4 fit into a 16byte fetch).  This comes right after
a taken branch:  "A taken branch reduces the number of instruction bytes
delivered to the decoders since the bytes after the taken branch are not
decoded."

 I think when I did the timing that suggested alignment helped a lot, I
hadn't tightened up the intro as much.  The loop started a decent way into a
16 byte fetch, _and_ the tail of the loop just barely stuck into the first
couple bytes of a 16 byte fetch.  That meant there were two bubbles together
(or one big bubble) in the instruction stream.  I guess the instruction
fetch/predecode doesn't run far enough ahead for the CPU to soak up a bubble
like that in our loop, which runs at ~3 IPC.

 With the current intro, the loop starts nearly aligned.  8-way unrolling it
is just as fast with or without the loop start exactly aligned.  So there's
no magic about exact alignment, just bubbles in the instruction stream.

 The Loop Stream Detector replays instructions from the 18-entry IQueue.
See section 3.4.2.4 Optimizing the Loop Stream Detector.  It kicks in iff
the loop is:  <= four 16-byte fetches, <= 18 instructions, and
has <= four taken branches (none of them a RET :P).
Instructions are sent to the decoders at up to 5/clock (presumably only
4/clock without macro-fusion).

 If the loop needs alignment to fit in four 16byte fetches, then aligning it
will probably help a lot.  Adding a couple nops to the end of the 4-way
unrolled loop to push it up to five 16 byte fetches took it from 1.3c/l to
1.57 c/l.  (It created a nasty bubble at the end of the loop.)

 The optimization manual recommends unrolling important loops anyway.  In
our case, the perf gain from more unrolling is probably going to be pretty
small.  Unrolling to 8-way without doing any more pipelining, SIZE=1001
gives maybe 1.249 c/l, where 4-way was 1.297. Once you're into L2 instead of
L1 the extra unrolling makes no difference. The only time it would be worth
8-way unrolling would be for an application that was dominated by shifts of
data in the L1 cache.

>    Pipelining it deeper might get it down even closer to 1c/l, even without
>   unrolling.
> 
> I doubt it.  Remember Core2 is a 3-way superscalar machine.

 Yeah.

>  (I think
> the insn fusion feature does not work for 64-bit instruction; if I am
> right we cannot expect 4-way issue at any time.)

 Right.  Macro-fusion (cmp/test + jcc -> 1 uop) only happens on IA32 code.
uop fusion (op w/ memory operand -> only 1 uop instead of a load and an ALU
op) happens in IA32 and AMD64 mode.  It took me a while to get that straight,
and keep track of what kind of fusion was what.

 Oprofile's UOPS_RETIRED even can count "fused store adress + data retired",
so I guess stores take 1 uop and that counts as fusion.  Maybe the memory
operand / alu was just an example in Intel's optimization manual.  Anyway,
point taken;  the 4-way unrolled loop is probably sustaining 3 uops/clock.
The loop overhead is 2 lea, 1 sub, and a jcc, which is probably 4 uops.
(3uops/limb * 4limbs + 4 uops ) * (1/3 clocks/uop) * (1/4 iter/limb)
= 16 / 3 / 4  clock/limb = 1.333 c/l.

 Hmm, unless the timing is off (which is not unlikely, since I'm still using
shift.c...), we're seeing 1.30-1.31 c/l for a 4-way unrolled loop.  That's
faster than 3uops/clock.  So you might be mistaken about the 3-wide nature.
Core 2 has 4 decoders.  It can issue up to 6 uops/clock (some of those would
have to be FP/SIMD, though.)  There are three issue ports for integer ALU, and
one for int loads.  Stores go through two issue ports at once, store address
and store data.

 So it looks like Core 2 could issue a shift, a load, and a store in one
cycle, and have room left for another ALU op.  It also has the retirement
bandwidth: "Each cycle, up to 4 results may be written back to the RS and
ROB, to be used as early as the next cycle by the RS.  This high execution
bandwidth enables execution bursts to keep up with the functional expansion
of the micro-fused Î¼ops that are decoded and retired"

 With oprofile, shift.c with SIZE=1001, measuring UOPS_RETIRED(micro-ops) and
CPU_CLK_UNHALTED(core cycles), both with count 2000000 (so we can compare
the numbers directly):
the 4-way unrolled loop had a retired uop count of 36127 and clock count of
9264 overall.  That's 3.90 uops/clock.  And that counts fused uops as 1 uop.
The 8-way unrolled loop did 32544/9217 = 3.53 uops/clock

 So there's not much room for improvement in the 4-way.  I'm going to try to
pipeline it more.

 I searched for a while yesterday, but nowhere can I find any information
about figuring out how many uops any given instruction takes.  Intel's
optimization manual sometimes says "this insn is 2 uops on Pentium M, but
only 1 on Core architecture...", but nothing that doesn't seem to assume you
already know the basics.  Except for a statement that most simple ALU ops
are 1 uop.  Great.  For a while I thought addressing modes might generate
uops to calculate addresses, but I see that there are AGUs with adders, so
that's not the case.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

[1]8-way:  
-test $3,  %dl
+test $7,  %dl
and duplicate the loop body, adjusting addr offsets.
for i in {1..20};do \rm shift.o; make CC="gcc -g -Wall -static -DSIZE=$i" shift > /dev/null  && sudo nice --5 ./shift 2400000000 2;done
(sudo nice --5 to make sure it gets a core to itself, since I have some java
and other crap running in the background...  These times are after cpufreq
decides to switch the core to full speed.)
SIZE=1; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1500ms (12.000 cycles/limb)
mpn_rshift_core2:     1504ms (12.032 cycles/limb)
SIZE=2; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1172ms (7.032 cycles/limb)
mpn_rshift_core2:     1168ms (7.008 cycles/limb)
SIZE=3; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1068ms (5.696 cycles/limb)
mpn_rshift_core2:     1056ms (5.632 cycles/limb)
SIZE=4; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1000ms (5.000 cycles/limb)
mpn_rshift_core2:     1000ms (5.000 cycles/limb)
SIZE=5; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      992ms (4.762 cycles/limb)
mpn_rshift_core2:     1004ms (4.819 cycles/limb)
SIZE=6; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      968ms (4.517 cycles/limb)
mpn_rshift_core2:      960ms (4.480 cycles/limb)
SIZE=7; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      940ms (4.297 cycles/limb)
mpn_rshift_core2:      940ms (4.297 cycles/limb)
SIZE=8; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      916ms (4.122 cycles/limb)
mpn_rshift_core2:      916ms (4.122 cycles/limb)
SIZE=9; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      504ms (2.240 cycles/limb)
mpn_rshift_core2:      500ms (2.222 cycles/limb)
SIZE=10; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      504ms (2.218 cycles/limb)
mpn_rshift_core2:      500ms (2.200 cycles/limb)
SIZE=11; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      568ms (2.479 cycles/limb)
mpn_rshift_core2:      564ms (2.461 cycles/limb)
SIZE=12; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      552ms (2.392 cycles/limb)
mpn_rshift_core2:      556ms (2.409 cycles/limb)
SIZE=13; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      612ms (2.636 cycles/limb)
mpn_rshift_core2:      616ms (2.654 cycles/limb)
SIZE=14; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      624ms (2.674 cycles/limb)
mpn_rshift_core2:      612ms (2.623 cycles/limb)
SIZE=15; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      640ms (2.731 cycles/limb)
mpn_rshift_core2:      644ms (2.748 cycles/limb)
SIZE=16; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      652ms (2.771 cycles/limb)
mpn_rshift_core2:      652ms (2.771 cycles/limb)
SIZE=17; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      408ms (1.728 cycles/limb)
mpn_rshift_core2:      412ms (1.745 cycles/limb)
SIZE=18; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      432ms (1.824 cycles/limb)
mpn_rshift_core2:      436ms (1.841 cycles/limb)
SIZE=19; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      460ms (1.937 cycles/limb)
mpn_rshift_core2:      460ms (1.937 cycles/limb)
SIZE=20; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      488ms (2.050 cycles/limb)
mpn_rshift_core2:      488ms (2.050 cycles/limb)

So compared to the 4-way unroll (times in the comments of my last email),
it's pretty bad, esp at size 8.  It's maybe 8 cycles slower total (about 32
instead of 24).  The c/l differences start to matter when n grows larger.
Esp. when you see c/l increasing with increasing n.