# rshift for 64bit Core2

Peter Cordes peter at cordes.ca
Sat Mar 22 23:05:03 CET 2008

On Sat, Mar 22, 2008 at 02:46:20PM +0100, Torbjorn Granlund wrote:
> Peter Cordes <peter at cordes.ca> writes:
>
>    Thanks for the starting point.  It only goes 1.33c/l for me, though.
>   (counting overhead, so maybe it's really 1.25).  I haven't profiled it yet,
>   either.
>
> Yes, 1.25 seems a bit optimistic, remeasuring it now yields 1.30 c/l
> just before it starts missing L1 cache.

I sometimes get 1.281c/l (at SIZE=1500).  But I'm still using shift.c, and the measurement
interval isn't very big when it's that fast.

Yeah, with SIZE=1001, it seems more and more like I get 1.281-1.297.  (1001
is optimal for the intro: the setup loop does 0 shrd insns.)

>    Instead of trying to set up all the registers for a computed jump into the
>   middle of the loop, I just used a rolled-up loop that goes until n&3 == 0,
>   at which point we're ready to enter the unrolled loop.  It's not the fastest
>   possible for small n, but it's pretty good and doesn't fill up the Icache.
>   It could also be used as a cleanup loop at the end, since it already tests
>   for n==0 as an exit condition.
>
> You'll find many variants for handling this in existing GMP assembly
> code.  For 4-way unrolling, I prefer a little decision tree, comparing
> (n mod 4) to 2 first, which sets condition code for deciding if n mod
> 4 is 3, 2, or smaller than 1.

I already came up with that myself. :)  That's the triple-jcc column on the
benchmark numbers (with my not-so-well-pipelined loop).  It did worse than
the intro-loop in nearly every case.  I think it would do even worse on the
pipelined loop that needs more setup (limbs in registers).  IIRC, Core 2
doesn't really like branch instructions with nothing in between.

The intro loop doesn't just figure out where to jump into the main loop, it
addresses pretty early, which gets the hardware prefetcher going.  I think
for the shift functions it's going to be hard to beat, esp. considering the
advantage of small size -> less icache pollution.  With the 4-way unrolled
loop and the short-loop intro, the whole function fits in 118 bytes, or two
icache lines, which seems reasonable.

With hot caches and branch predictor, I bet the fastest code on the
synthetic benchmark would be a test 1,%dl; jz into a two-way unrolled intro loop that ran until n mod whatever == 0, then enter the main loop at the top. Register allocation could probably be arranged so you don't need too many moves between loops. This might be worth it if we go with an 8-way unroll of the main loop. I did some timing with 8-way vs. 4-way unroll for small n[1], and the simple intro loop costs up to 8 cycles (depending on n mod 8 vs. whatever is good/bad for the alternatives). In real life, shift is probably mixed in with plenty of other code. For small n, trading more icache misses for maybe 5 cycles total doesn't seem worth it. For large n, the intro becomes a smaller and smaller fraction. > I then jump into the loop from the four cases. I avoid a special > loop, since that adds considerable overhead. (The overhead for a > short loop is typically large, partially due to branch prediction > problems.) re: branch prediction, and the sub-thread Niels started: >This requires a complex branch > predictor that can match a pattern of taken and non-taken (taken 5 > times, non-taken, taken 5 times, non-taken ...) Core 2's branch predictor detects patterns of length <= 16. That's why the optimization manuals suggest unrolling loops of known length until they need <= 16 iterations. Modern branch predictors are very fancy, and probably worth every transistor. :) This helps us only when rshift is called multiple times with args of the _same length_. I suppose this is the most common case, though. None of the branches are very far, and they're all in the same cache line (or to the next one, when jumping to the end of the function from the intro loop). So the mispredict penalty should be low, and the processor can easily speculate on the branch going the other way. It might be worth testing by putting some code outside the timing loop to overflow the branch-target buffer, to see how different intros do with a cold branch predictor. The problem is that I don't have any other intros that set up all the limbs in the right registers for the pipelined loop. And just putting in not-very-tight setup code into the computed-goto or decision-tree versions doesn't seem fair. I could try all three with my not-so-pipelined loop as the main loop, since it would be easy to put my simple-loop intro in front of it. 1.5c/l vs. 1.3 c/l won't be a big deal for n=1..10, compared to mispredicted branches. The Intel manual claims Core 2 dynamically predicts all branches, even if the BP unit hasn't seen them before. It says the BPU looks ahead further than the rest of the ifetch hardware, so it can see branches coming up before it gets to them. I was never clear on exactly what that meant... > Whether to put the n mod 4 code before or after the unrolled loop is > another question. One may avoid the n mod X computation by doing the > odd stuff after the unrolled loop, this is particularly important if > the unrolling factor is not a power of 2. > > If the unrolling factor is greater than 4, a computed goto or a branch > table is usually fastest. But one should then perhaps avoid the > branch table for small n, to avoid its overhead. > > It also makes it easy to try different loop unrolling, because the loop > only has to be entered in one place. Unrolling to 8limbs/iter can hit 1.20c/l. > It requires the loop body to be ALIGN(16)ed, though. Without that, it's > slower than 4limbs/iter. > > Perhaps loop aligning becomes more critical when the loop > 64 bytes? See the Intel optimization manual, sections 2.1.2.2 (Instruction fetch unit) and 2.1.2.3 (Instruction Queue). If the jump to the top of the loop is to an address = to 12 mod 16 for example, and it only contains 1 instruction before the 16-byte boundary, then the instruction pre-decode stage will only deliver one instruction to the instruction queue in that clock cycle. The instructions used in the loop are mostly of size 4 (so 4 fit into a 16byte fetch). This comes right after a taken branch: "A taken branch reduces the number of instruction bytes delivered to the decoders since the bytes after the taken branch are not decoded." I think when I did the timing that suggested alignment helped a lot, I hadn't tightened up the intro as much. The loop started a decent way into a 16 byte fetch, _and_ the tail of the loop just barely stuck into the first couple bytes of a 16 byte fetch. That meant there were two bubbles together (or one big bubble) in the instruction stream. I guess the instruction fetch/predecode doesn't run far enough ahead for the CPU to soak up a bubble like that in our loop, which runs at ~3 IPC. With the current intro, the loop starts nearly aligned. 8-way unrolling it is just as fast with or without the loop start exactly aligned. So there's no magic about exact alignment, just bubbles in the instruction stream. The Loop Stream Detector replays instructions from the 18-entry IQueue. See section 3.4.2.4 Optimizing the Loop Stream Detector. It kicks in iff the loop is: <= four 16-byte fetches, <= 18 instructions, and has <= four taken branches (none of them a RET :P). Instructions are sent to the decoders at up to 5/clock (presumably only 4/clock without macro-fusion). If the loop needs alignment to fit in four 16byte fetches, then aligning it will probably help a lot. Adding a couple nops to the end of the 4-way unrolled loop to push it up to five 16 byte fetches took it from 1.3c/l to 1.57 c/l. (It created a nasty bubble at the end of the loop.) The optimization manual recommends unrolling important loops anyway. In our case, the perf gain from more unrolling is probably going to be pretty small. Unrolling to 8-way without doing any more pipelining, SIZE=1001 gives maybe 1.249 c/l, where 4-way was 1.297. Once you're into L2 instead of L1 the extra unrolling makes no difference. The only time it would be worth 8-way unrolling would be for an application that was dominated by shifts of data in the L1 cache. > Pipelining it deeper might get it down even closer to 1c/l, even without > unrolling. > > I doubt it. Remember Core2 is a 3-way superscalar machine. Yeah. > (I think > the insn fusion feature does not work for 64-bit instruction; if I am > right we cannot expect 4-way issue at any time.) Right. Macro-fusion (cmp/test + jcc -> 1 uop) only happens on IA32 code. uop fusion (op w/ memory operand -> only 1 uop instead of a load and an ALU op) happens in IA32 and AMD64 mode. It took me a while to get that straight, and keep track of what kind of fusion was what. Oprofile's UOPS_RETIRED even can count "fused store adress + data retired", so I guess stores take 1 uop and that counts as fusion. Maybe the memory operand / alu was just an example in Intel's optimization manual. Anyway, point taken; the 4-way unrolled loop is probably sustaining 3 uops/clock. The loop overhead is 2 lea, 1 sub, and a jcc, which is probably 4 uops. (3uops/limb * 4limbs + 4 uops ) * (1/3 clocks/uop) * (1/4 iter/limb) = 16 / 3 / 4 clock/limb = 1.333 c/l. Hmm, unless the timing is off (which is not unlikely, since I'm still using shift.c...), we're seeing 1.30-1.31 c/l for a 4-way unrolled loop. That's faster than 3uops/clock. So you might be mistaken about the 3-wide nature. Core 2 has 4 decoders. It can issue up to 6 uops/clock (some of those would have to be FP/SIMD, though.) There are three issue ports for integer ALU, and one for int loads. Stores go through two issue ports at once, store address and store data. So it looks like Core 2 could issue a shift, a load, and a store in one cycle, and have room left for another ALU op. It also has the retirement bandwidth: "Each cycle, up to 4 results may be written back to the RS and ROB, to be used as early as the next cycle by the RS. This high execution bandwidth enables execution bursts to keep up with the functional expansion of the micro-fused Î¼ops that are decoded and retired" With oprofile, shift.c with SIZE=1001, measuring UOPS_RETIRED(micro-ops) and CPU_CLK_UNHALTED(core cycles), both with count 2000000 (so we can compare the numbers directly): the 4-way unrolled loop had a retired uop count of 36127 and clock count of 9264 overall. That's 3.90 uops/clock. And that counts fused uops as 1 uop. The 8-way unrolled loop did 32544/9217 = 3.53 uops/clock So there's not much room for improvement in the 4-way. I'm going to try to pipeline it more. I searched for a while yesterday, but nowhere can I find any information about figuring out how many uops any given instruction takes. Intel's optimization manual sometimes says "this insn is 2 uops on Pentium M, but only 1 on Core architecture...", but nothing that doesn't seem to assume you already know the basics. Except for a statement that most simple ALU ops are 1 uop. Great. For a while I thought addressing modes might generate uops to calculate addresses, but I see that there are AGUs with adders, so that's not the case. -- #define X(x,y) x##y Peter Cordes ; e-mail: X(peter at cor , des.ca) "The gods confound the man who first found out how to distinguish the hours! Confound him, too, who in this place set up a sundial, to cut and hack my day so wretchedly into small pieces!" -- Plautus, 200 BC [1]8-way: -test3,  %dl
+test $7, %dl and duplicate the loop body, adjusting addr offsets. for i in {1..20};do \rm shift.o; make CC="gcc -g -Wall -static -DSIZE=$i" shift > /dev/null  && sudo nice --5 ./shift 2400000000 2;done
(sudo nice --5 to make sure it gets a core to itself, since I have some java
and other crap running in the background...  These times are after cpufreq
decides to switch the core to full speed.)
SIZE=1; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1500ms (12.000 cycles/limb)
mpn_rshift_core2:     1504ms (12.032 cycles/limb)
SIZE=2; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1172ms (7.032 cycles/limb)
mpn_rshift_core2:     1168ms (7.008 cycles/limb)
SIZE=3; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1068ms (5.696 cycles/limb)
mpn_rshift_core2:     1056ms (5.632 cycles/limb)
SIZE=4; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:     1000ms (5.000 cycles/limb)
mpn_rshift_core2:     1000ms (5.000 cycles/limb)
SIZE=5; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      992ms (4.762 cycles/limb)
mpn_rshift_core2:     1004ms (4.819 cycles/limb)
SIZE=6; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      968ms (4.517 cycles/limb)
mpn_rshift_core2:      960ms (4.480 cycles/limb)
SIZE=7; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      940ms (4.297 cycles/limb)
mpn_rshift_core2:      940ms (4.297 cycles/limb)
SIZE=8; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      916ms (4.122 cycles/limb)
mpn_rshift_core2:      916ms (4.122 cycles/limb)
SIZE=9; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      504ms (2.240 cycles/limb)
mpn_rshift_core2:      500ms (2.222 cycles/limb)
SIZE=10; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      504ms (2.218 cycles/limb)
mpn_rshift_core2:      500ms (2.200 cycles/limb)
SIZE=11; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      568ms (2.479 cycles/limb)
mpn_rshift_core2:      564ms (2.461 cycles/limb)
SIZE=12; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      552ms (2.392 cycles/limb)
mpn_rshift_core2:      556ms (2.409 cycles/limb)
SIZE=13; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      612ms (2.636 cycles/limb)
mpn_rshift_core2:      616ms (2.654 cycles/limb)
SIZE=14; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      624ms (2.674 cycles/limb)
mpn_rshift_core2:      612ms (2.623 cycles/limb)
SIZE=15; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      640ms (2.731 cycles/limb)
mpn_rshift_core2:      644ms (2.748 cycles/limb)
SIZE=16; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      652ms (2.771 cycles/limb)
mpn_rshift_core2:      652ms (2.771 cycles/limb)
SIZE=17; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      408ms (1.728 cycles/limb)
mpn_rshift_core2:      412ms (1.745 cycles/limb)
SIZE=18; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      432ms (1.824 cycles/limb)
mpn_rshift_core2:      436ms (1.841 cycles/limb)
SIZE=19; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      460ms (1.937 cycles/limb)
mpn_rshift_core2:      460ms (1.937 cycles/limb)
SIZE=20; alignment: s1&0xf = 0; (dy+1)&0xf = 0
mpn_rshift_core2:      488ms (2.050 cycles/limb)
mpn_rshift_core2:      488ms (2.050 cycles/limb)

So compared to the 4-way unroll (times in the comments of my last email),
it's pretty bad, esp at size 8.  It's maybe 8 cycles slower total (about 32
instead of 24).  The c/l differences start to matter when n grows larger.
Esp. when you see c/l increasing with increasing n.