I've ported GMP to Mac Pro. GMPbench > 7700

Mon Oct 16 17:32:04 CEST 2006

"Jason Martin" <jason.worth.martin at gmail.com> writes:

  I played with my code some on an Opteron machine today and got it down
  to 2.5 cycles/limb.  There is a trade-off, of course.  For low limb
  counts (less than 16) the old code is still faster on the Opteron.  On
  a Woodcrest, however, my unrolled add_n beats the old code for all
  limb counts (which is extremely surprising considering I save off so
  many registers to the stack, it must be doing something clever to
  avoid read-after-write stalls).

The theoretical limit for add_n and sub_n is 1.5 cycles/limb for
Opteron, and 2.0 cycles/limb on Woodcrest (or Conroe).

  > You might want to see how close to 4 cycles/limb you can get for a new
  > mpn_addmul_1 and friends.  (The mulq instruction cannot be repeated
  > more than once every 4th cycles, so mpn_addmul_1 will never run better
  > than at 4 cycles/limb using mulq.)

  I played with this some today, but the best I could get was around 6.5
  cycles/limb on the Woodcrest which is about the same as the old code
  on Woodcrest.  It didn't benifit at all from unrolling.

You might need to reorder and schedule instructions, try different
instruction combinations, and different memory addressing schemes.
This can be a tedious task.

The theoretical limit for addmul_1 for Opteron might be 2 cycles/limb,
although I am not aware of any loop that runs better than at 3
cycles/limb.  The theoretical limit for addmul_1 for Woodcrest is >= 4
cycles/limb, but I am not aware of any loop that runs better than at 5
cycles/limb.

  Does the Core 2 64 bit multiply use the 64 bit adders from multiple
  ALU execution units?  From the descriptions I've seen, I expected it
  to be able to issue a multiply, two adds, and a load/store
  simultaneously.  However, that doesn't seem to be the case (but
  perhaps I'm just confused).

Never trust vendors' manuals.  Experimenting with the pipelines gives
more accurate data.

-- 
Torbjörn