I've ported GMP to Mac Pro. GMPbench > 7700

Mon Oct 16 05:39:35 CEST 2006

On 15 Oct 2006 14:58:16 +0200, Torbjorn Granlund <tege at swox.com> wrote:
> "Jason Martin" <jason.worth.martin at gmail.com> writes:
>
>   After everything is in cache and the limb count is high enough, I'm
>   getting 3 clock cycles/limb on Woodcrest and 3.5 clock cycles/limb on
>   Conroe.  Note, however, that to test my code out on my Linux Conroe
>   box, I had to replace the lahf and sahf instructions with bt and setc
>   which seem to be a little slower (at least Agner Fog says so).  I've
>   attached my testing code and timing routines so you can see exactly
>   what I'm doing.
>
> To save the carry flag, use sbb or setc.
> To restore it, use a plain add.
>
> 3 cycles/limb is much better than the present 13 or so cycles.  But it
> is possible to reach 2 cycles/limb with unrolling and jrcxz.

I tried various permutations of these ideas today and saw no change in
performance.  I also still see a definite difference in performance
(in terms of cycles/limbs) between my Woodcrest and Conroe chips even
though they are theoretically the same core.

I played with my code some on an Opteron machine today and got it down
to 2.5 cycles/limb.  There is a trade-off, of course.  For low limb
counts (less than 16) the old code is still faster on the Opteron.  On
a Woodcrest, however, my unrolled add_n beats the old code for all
limb counts (which is extremely surprising considering I save off so
many registers to the stack, it must be doing something clever to
avoid read-after-write stalls).

> You might want to see how close to 4 cycles/limb you can get for a new
> mpn_addmul_1 and friends.  (The mulq instruction cannot be repeated
> more than once every 4th cycles, so mpn_addmul_1 will never run better
> than at 4 cycles/limb using mulq.)

I played with this some today, but the best I could get was around 6.5
cycles/limb on the Woodcrest which is about the same as the old code
on Woodcrest.  It didn't benifit at all from unrolling.

Does the Core 2 64 bit multiply use the 64 bit adders from multiple
ALU execution units?  From the descriptions I've seen, I expected it
to be able to issue a multiply, two adds, and a load/store
simultaneously.  However, that doesn't seem to be the case (but
perhaps I'm just confused).

--jason

-- 
"Ever my heart rises as we draw near the mountains.
There is good rock here." -- Gimli, son of Gloin