I've ported GMP to Mac Pro. GMPbench > 7700
Jason Martin
jason.worth.martin at gmail.com
Mon Oct 16 05:39:35 CEST 2006
On 15 Oct 2006 14:58:16 +0200, Torbjorn Granlund <tege at swox.com> wrote:
> "Jason Martin" <jason.worth.martin at gmail.com> writes:
>
> After everything is in cache and the limb count is high enough, I'm
> getting 3 clock cycles/limb on Woodcrest and 3.5 clock cycles/limb on
> Conroe. Note, however, that to test my code out on my Linux Conroe
> box, I had to replace the lahf and sahf instructions with bt and setc
> which seem to be a little slower (at least Agner Fog says so). I've
> attached my testing code and timing routines so you can see exactly
> what I'm doing.
>
> To save the carry flag, use sbb or setc.
> To restore it, use a plain add.
>
> 3 cycles/limb is much better than the present 13 or so cycles. But it
> is possible to reach 2 cycles/limb with unrolling and jrcxz.
I tried various permutations of these ideas today and saw no change in
performance. I also still see a definite difference in performance
(in terms of cycles/limbs) between my Woodcrest and Conroe chips even
though they are theoretically the same core.
I played with my code some on an Opteron machine today and got it down
to 2.5 cycles/limb. There is a trade-off, of course. For low limb
counts (less than 16) the old code is still faster on the Opteron. On
a Woodcrest, however, my unrolled add_n beats the old code for all
limb counts (which is extremely surprising considering I save off so
many registers to the stack, it must be doing something clever to
avoid read-after-write stalls).
> You might want to see how close to 4 cycles/limb you can get for a new
> mpn_addmul_1 and friends. (The mulq instruction cannot be repeated
> more than once every 4th cycles, so mpn_addmul_1 will never run better
> than at 4 cycles/limb using mulq.)
I played with this some today, but the best I could get was around 6.5
cycles/limb on the Woodcrest which is about the same as the old code
on Woodcrest. It didn't benifit at all from unrolling.
Does the Core 2 64 bit multiply use the 64 bit adders from multiple
ALU execution units? From the descriptions I've seen, I expected it
to be able to issue a multiply, two adds, and a load/store
simultaneously. However, that doesn't seem to be the case (but
perhaps I'm just confused).
--jason
--
"Ever my heart rises as we draw near the mountains.
There is good rock here." -- Gimli, son of Gloin
More information about the gmp-devel
mailing list