I've ported GMP to Mac Pro. GMPbench > 7700
tege at swox.com
Mon Oct 16 17:32:04 CEST 2006
"Jason Martin" <jason.worth.martin at gmail.com> writes:
I played with my code some on an Opteron machine today and got it down
to 2.5 cycles/limb. There is a trade-off, of course. For low limb
counts (less than 16) the old code is still faster on the Opteron. On
a Woodcrest, however, my unrolled add_n beats the old code for all
limb counts (which is extremely surprising considering I save off so
many registers to the stack, it must be doing something clever to
avoid read-after-write stalls).
The theoretical limit for add_n and sub_n is 1.5 cycles/limb for
Opteron, and 2.0 cycles/limb on Woodcrest (or Conroe).
> You might want to see how close to 4 cycles/limb you can get for a new
> mpn_addmul_1 and friends. (The mulq instruction cannot be repeated
> more than once every 4th cycles, so mpn_addmul_1 will never run better
> than at 4 cycles/limb using mulq.)
I played with this some today, but the best I could get was around 6.5
cycles/limb on the Woodcrest which is about the same as the old code
on Woodcrest. It didn't benifit at all from unrolling.
You might need to reorder and schedule instructions, try different
instruction combinations, and different memory addressing schemes.
This can be a tedious task.
The theoretical limit for addmul_1 for Opteron might be 2 cycles/limb,
although I am not aware of any loop that runs better than at 3
cycles/limb. The theoretical limit for addmul_1 for Woodcrest is >= 4
cycles/limb, but I am not aware of any loop that runs better than at 5
Does the Core 2 64 bit multiply use the 64 bit adders from multiple
ALU execution units? From the descriptions I've seen, I expected it
to be able to issue a multiply, two adds, and a load/store
simultaneously. However, that doesn't seem to be the case (but
perhaps I'm just confused).
Never trust vendors' manuals. Experimenting with the pipelines gives
more accurate data.
More information about the gmp-devel