rshift for 64bit Core2

Tue Mar 25 11:13:48 CET 2008

 I re-worked the cleanup code to do the leftover limbs after the
unrolled loop.  Now it doesn't segfault under electric fence, or read
past the end of src at all.  It's faster for small n, too.

 I also parameterized things with m4.  I can switch between 4-limb and
8-limb unrolling by uncommenting an m4 define.  And I can move the lea
instructions up and down in the loop easily.  I'd never used m4
before, but it's not bad. :)

 code in my darcs repo.  I'm still tweaking things, so I won't copy a
big chunk into this email.  The code in the repo (as of this writing)
works great, but I'm working on what order to put the blocks in for
good locality with small n.  e.g. c2_unrolled_cleanup can go after the
jmp, before c2_shortloop.  It's tricky to avoid stalling instruction
fetch with a badly placed branch, or branch target.

On Mon, Mar 24, 2008 at 03:10:45PM +0100, Torbjorn Granlund wrote:
> If we're really crazy, we should use separate code for the L2 miss
> case and for the memory case.

 nice. :^)

> DDR memory does not like to interleave reads and writes, it prefers to
> do either of them.  But the GMP loops tend to interleave reads and
> writes in a DDR-pessimal way.
> 
> The way around this is prefetching read operands into L1 cache,
> by an explicit loop:

 SSE2 provides a few flavours of movntxx (nt = non-temporal = bypass
cache).  You can also prefetch to L1 but not L2 (prefetchnta), which
would be what we want here.  But we want to actually wait for the
fetch before doing our writes.  I don't see a way to fetch to L1 only
and then wait for it. :(

 You'd want to load some, and issue prefetches for more, then start
computing.  Otherwise your leaving the bus idle while you work.
Unless the hardware prefetcher knows to keep going far ahead...

 Are DRAM pages the same size a 4kB CPU pages?  If so, it might be
reasonable to read a memory page and prefetch the first line of the
next page.  Opening and closing DRAM pages is time consuming, right?
After issuing the prefetch, start working on the page you just read,
writing another complete page.  Or two incomplete pages, if the source
and dest aren't aligned the same way mod 4kB.

 You could use the sfence instruction to make sure all stores happen
before loads start on the next page.

> Some processors will always first read a written cache line from
> memory upon miss, which makes some sense as the processor cannot
> foresee whether the entire cache line will be overwritten.  If this
> happens in the code above, we might still get interleaving of reads
> and writes to memory, and also 33% of all memory operations will be
> useless reads.  One would like a mechanism for write-allocating a
> cache line (i.e., put it in the `E' state [E of the MESI cache
> states]).  Some processors allow for this, but I have yet to see a
> well-designed such interface.

 Isn't that what SSE2's "streaming stores" are for?  The CPU has some
write-combining store buffers.  There's even a  movnti r64, m64, so
you don't have to put things into xmm regs first.  This writes around
the cache, so conflicts with the prefetch strategy I suggested.  OTOH,
it's more likely to work!

> The P6 family tends to get unstable speed because of exactly that; if
> you stall in one place, more values are committed to the register file
> in the meantime, which then cause more ROB stalls later on.

 Does the CPU tend to get stuck in a steady-state slower-than-optimal
speed for many loop iterations?  (e.g. until the next interrupt?)
With profiling, you can't tell if the ROB read port pileups happened
on occasional loop iterations over the whole program execution, of if
one invocation of the function ran a lot slower and generated a big
chunk of pileups at once.  I guess with profiling enabled, generating
an interrupt will kick the processor out of the loop.

 So do you mean unstable wrt. multiple runs of the same code, or
unstable wrt. instruction ordering.

> 1.1663 is a nice improvement!

 Thanks.

>  My code measures at 1.27 c/l, with
> speed, forcing speed to measure more accurately (with -p10000000).
> You might want to use -p10000000 too.

 I've tried various combinations of sudo nice --7, and -p.  With -p
too large, speed uses up it's time slice and gets rescheduled.  To
give a CPU exclusively to speed, I do this:

make speed-ext; timeout -15 5 taskset 2 burnK7; for i in {1..3};do GMP_CPU_FREQUENCY=2400e6 \time sudo nice --7 taskset 2 chrt -f 1 timeout -15 20 ./speed-ext -p 5000000 -C -s 1-30,495-505,1000-1008 rshift.8 rshift.8  ;done
# DONT RUN THIS on a uniproc machine.  chrt -f = FIFO realtime prio!!!
# taskset runs it on the second CPU.  check /proc/interrupts to see
# what CPU handles which interrupt.  On my system, the first CPU
# handles all the interrupts.  Binding this to the first CPU disables
# the keyboard until it's done.  note that timeout runs at realtime
# prio, too, so it can send signals.  So you wouldn't have to press
# the reset button on a UP machine.

I put speed-ext.c in my darcs repository, with explanation of that
command line in the comments.
http://cordes.ca/cgi-bin/darcs.cgi/gmp-shift/?c=annotate&p=20080325024627-0feb0-2eeb25effc0a5e1fad6c95c25c8d8f2b9c36dee0.gz
http://cordes.ca/repos/gmp-shift/speed-ext.c
  GMP_CPU_FREQUENCY is needed because
speed.c gets the CPU clock from the first CPU in /proc/cpuinfo.
x86info has a good MHz estimate function:
http://git.choralone.org/?p=x86info.git;a=blob;f=bench/MHz.c;h=5cba53a9f69ae737458a780004b5a9d1d453cebf;hb=7967b164c6b420ef15f4731301f3b0e651bb46f5
it uses gettimeofday and rdtsc to time a usleep(250000), and divides
seconds by cycles.  On recent CPUs (with constant_tsc in
/proc/interrupts) it will always get the max speed, not the current
real speed, though.

> My code is 259 bytes, including optimized feed-in code and wind-down
> code.  I suspect it will be difficult to make smaller than that with
> 8-way unrolling, unless you sacrifice feed-in efficiency.  But I'd
> love to be proven wrong.

 Mine is 220B, but it could be better optimized.  I have a 1-limb loop
doing cleanup.  With unroll=8, it has to run up to 7 times.  Plotting
cycles (not c/l) vs. n shows a rising sawtooth...  (nice job with
speed.c, BTW.  quickplot is a great for looking at data.)

 My code has a lot of branch instructions, but some of them aren't on
the paths taken for n <= 4.  Also I haven't really counted how many
taken conditional branches there are for different n.  Non-taken
branches don't fill up the brach target buffer.

 See http://cordes.ca/repos/gmp-shift/rshift.asm  and search for
"unroll=8 pipedepth=3".  I made a table of what path the code's
supposed to take for each n up to 16.

 The optimization manual says unconditional branches on core2 are
essentially free (except for the instruction stream disruption).  I'm
not sure if they pollute the branch predictor at all, though.  I hope
not, because I have a couple of them, and I think they're better than
duplicating code.

> There is often a tradeoff between
> code size and small operands overhead; we should give reducing small
> operands overhead priority.

 Ok.  Minimal overhead in the already-cached repeated-calls case, I
assume.  I'll still try to minimize e.g. branch predictor resources I
need to use up for it to run fast.

> How's your (fully functional) code behaving for small operands?  Here
> are tune/speed's measurements for small sizes for my code:

I merged your results with mine for easier comparison.  This is for
8-limb unrolled, with the loop ALIGN(16)ed.  The loop uses only 4
regs, so I don't need to push/pop any callee-saved regs.  I moved the
stores ahead of the loads so I could reuse regs in the loop.  I found
places to put the lea/sub instructions that make it still run fast on
large n.

 Looking at cycles, not c/l, seems a lot more useful when comparing
how good the different paths taken for small n are.  (Actually,
running speed-ext at realtime FIFO priority on a core that only
handles its timer interrupt means I often get results reproducible
to the cycle for at least n=1..1000.  I haven't been testing very
large n much recently.)

 My 4-limb unrolled loop has better small-n performance, because the
cleanup loop doesn't run as much.  But it can't match 8-limb unrolling
for larger n.  I should have copied it in as another column, but I
forgot, and it's dawn already. :P

 If the two columns don't say the same thing, it's because they were
from different runs.  None of the numbers looks out of place to me,
though.  I really do see times like 1.15 c/l :)

	overhead 6.00 cycles, precision 5000000 units of 4.17e-10 secs, CPU freq 2400.00 MHz	6.00
mpn_lshift.1	     rshift.8	rshift.8(cycles)
11.0099	1              9.0005	9.00
7.0064	2              6.0003	12.00
5.3467	3              5.0002	15.00
3.2582	4              4.0002	16.00
3.2029	5              3.8002	19.00
2.8359	6              3.6668	22.00
2.5739	7              3.5716	25.00
2.2527	8              3.5002	28.00
2.3356	9              3.4446	31.00
2.2022	10             3.4002	34.00
2.1839	11             3.3638	37.00
1.9203	12             1.8334	22.00
2.3115	13             1.9232	25.00
2.2263	14             2.0004	28.00
1.9377	15             2.0668	31.00
1.7613	16             2.1252	34.00
1.8290	17             2.1766	37.00
1.8616	18             2.2223	40.00
1.8361	19             2.2633	43.00
1.6721	20             1.6003	32.00
	500            1.1471	573.05
	501            1.1498	575.73
	502            1.1527	578.11
	505            1.1626	587.19
	1000           1.1622	1163.84
	1001           1.2159	1165.18
	1002           1.1653	1167.66

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter at cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC