rshift for 64bit Core2

Torbjorn Granlund tg at swox.com
Sat Mar 22 14:46:20 CET 2008


Peter Cordes <peter at cordes.ca> writes:

   Thanks for the starting point.  It only goes 1.33c/l for me, though.
  (counting overhead, so maybe it's really 1.25).  I haven't profiled it yet,
  either.
  
Yes, 1.25 seems a bit optimistic, remeasuring it now yields 1.30 c/l
just before it starts missing L1 cache.

   Instead of trying to set up all the registers for a computed jump into the
  middle of the loop, I just used a rolled-up loop that goes until n&3 == 0,
  at which point we're ready to enter the unrolled loop.  It's not the fastest
  possible for small n, but it's pretty good and doesn't fill up the Icache.
  It could also be used as a cleanup loop at the end, since it already tests
  for n==0 as an exit condition.

You'll find many variants for handling this in existing GMP assembly
code.  For 4-way unrolling, I prefer a little decision tree, comparing
(n mod 4) to 2 first, which sets condition code for deciding if n mod
4 is 3, 2, or smaller than 1.

I then jump into the loop from the four cases.  I avoid a special
loop, since that adds considerable overhead.  (The overhead for a
short loop is typically large, partially due to branch prediction
problems.)

Whether to put the n mod 4 code before or after the unrolled loop is
another question.  One may avoid the n mod X computation by doing the
odd stuff after the unrolled loop, this is particularly important if
the unrolling factor is not a power of 2.

If the unrolling factor is greater than 4, a computed goto or a branch
table is usually fastest.  But one should then perhaps avoid the
branch table for small n, to avoid its overhead.

   It also makes it easy to try different loop unrolling, because the loop
  only has to be entered in one place.  Unrolling to 8limbs/iter can hit 1.20c/l.
  It requires the loop body to be ALIGN(16)ed, though.  Without that, it's
  slower than 4limbs/iter.
  
Perhaps loop aligning becomes more critical when the loop > 64 bytes?

   Pipelining it deeper might get it down even closer to 1c/l, even without
  unrolling.

I doubt it.  Remember Core2 is a 3-way superscalar machine.  (I think
the insn fusion feature does not work for 64-bit instruction; if I am
right we cannot expect 4-way issue at any time.)

-- 
Torbjörn


More information about the gmp-devel mailing list