rshift for 64bit Core2

Mon Mar 24 15:10:45 CET 2008

Peter Cordes <peter at cordes.ca> writes:

   Using a longer software pipeline (code below) kills performance from L2.
  By the time the register will be used, it's cold.  SIZE=10001 gets 3.12c/l
  instead of 2.464c/l for the 4-limb loop I posted before, or for the below
  8-limb loop pipelined one less deep.  Main mem is still ~10.8c/l.  You get
  tons of ROB read port stalls, almost one/clock on some instructions, but
  it's still keeping up with main mem.

If we're really crazy, we should use separate code for the L2 miss
case and for the memory case.

DDR memory does not like to interleave reads and writes, it prefers to
do either of them.  But the GMP loops tend to interleave reads and
writes in a DDR-pessimal way.

The way around this is prefetching read operands into L1 cache,
by an explicit loop:

  chunk_size = L1size / BYTES_PER_MP_LIMB - something;
  for (...)
    {
      for (i = chunk_size - 1; i >= 0; i--)
        READ(ap[i]);

      for (i = 0; i < chunk_size; i++)
        SOME_GMP_OPERATION(rp[i], ap[i]);

      store_commit;

      if (less than chunk_size remains)
        chunk_size = remaining size;
    }

All memory reads will happen in the first loop, while all memory
writes will happen in the second loop (or in the store_commit).  One
will need special instructions for storing data, or a special
store_commit to make this happen.  Highly non-portable, but GMP can
deal with that thanks to its asm selection mechanism.

Some processors will always first read a written cache line from
memory upon miss, which makes some sense as the processor cannot
foresee whether the entire cache line will be overwritten.  If this
happens in the code above, we might still get interleaving of reads
and writes to memory, and also 33% of all memory operations will be
useless reads.  One would like a mechanism for write-allocating a
cache line (i.e., put it in the `E' state [E of the MESI cache
states]).  Some processors allow for this, but I have yet to see a
well-designed such interface.

PowerPC has some instructions intended for this, but they are cache
line size dependent.  We need a mechanism that works with current
32-byte and 64-byte lines, as well as with future larger cache lines.

What is there to win?  At 2.4 GHz and 10.8 cycles per 128 bits (64
bits read and 64 bits written) you're transferring only 3.56 GB/s,
whereas the bandwidth of your memory is probably > 10 GB/s.

I have no idea if a read/write loop can ever get close to the memory
bandwidth, my experiments some while ago with memcpy only reached
around half of the theoretical bandwidth.

   I commented it out and went back to 4-limb unrolled, then I had
  trouble getting the 8-limb loop back to its former speed.  It turns
  out it's sensitive to exactly where you mix in the lea instructions
  to increment the pointers.  (They can go anywhere, you just subtract
  64 from the offsets of anything you move them above.)  Moving things
  around can make it as bad a 1.4c/l, and it's pretty easy to get
  1.3c/l.  Usually there's a lot of ROB read port stall hits at one
  spot in the loop when it's not running at full speed.

The P6 family tends to get unstable speed because of exactly that; if
you stall in one place, more values are committed to the register file
in the meantime, which then cause more ROB stalls later on.  I hope
Intel will increase the number of read ports in the future.

   I now have a version[2] that hits 1.1663 c/l (with speed.c).  ROB
  read port stalls happen 1/100 clock cycles.  67300uop counts / 17319
  clock counts = 3.8859 uops/clock.  I attached the output of
  opannotate --assembly (note that only clocks and uops have the same
  counter; the others trigger much more often just to see them at
  all.)

1.1663 is a nice improvement!  My code measures at 1.27 c/l, with
speed, forcing speed to measure more accurately (with -p10000000).
You might want to use -p10000000 too.

   Whatever I do, this is going to need cleanup code at the end,
  because any pipelined loop will read off the end of src array unless
  it stops early.  So maybe I'll have a hard time fitting it in two
  cache lines with a 4-limb unroll after all.

My code is 259 bytes, including optimized feed-in code and wind-down
code.  I suspect it will be difficult to make smaller than that with
8-way unrolling, unless you sacrifice feed-in efficiency.  But I'd
love to be proven wrong.

Now, code size is important, but not critically so.  GMP has a fairly
small foot-print, and only a few percent of that foot print comes from
the critical assembly inner loops.  There is often a tradeoff between
code size and small operands overhead; we should give reducing small
operands overhead priority.

One may read the first operand unconditionally for most mpn functions,
since they are documented as not supporting n = 0.  This can be used
to shave off a cycle or two.

How's your (fully functional) code behaving for small operands?  Here
are tune/speed's measurements for small sizes for my code:

         mpn_lshift.1
1             11.0099
2              7.0064
3              5.3467
4              3.2582
5              3.2029
6              2.8359
7              2.5739
8              2.2527
9              2.3356
10             2.2022
11             2.1839
12             1.9203
13             2.3115
14             2.2263
15             1.9377
16             1.7613
17             1.8290
18             1.8616
19             1.8361
20             1.6721

-- 
Torbjörn