Improved mpn code for Core 2

Sat Dec 2 02:51:28 CET 2006

Hi All,

I've managed to improve the addmul_1 (and friends) mpn routines for
Core 2 processors.

My addmul_1 executes at 4.6 cycles/limb with a 4-way unroll of the
main loop and 4.3 cycles/limb with a 16-way unroll.  I believe that
this is close to optimal for the Core 2 architecture.  submul_1
behaves identically to addmul_1, and mul_1 executes at 4 cycles/limb.

This, together with some earlier changes to add_n and sub_n provide
for a GMPbench score of 8260 on my 2.66GHz Mac Pro, so it appears to
make quite a difference for the Core 2 architecture.

If you're interested, the code is available on my homepage:

    http://www.math.jmu.edu/~martin

For those who asked:  I've included an install routine that detects
the CPU and will only install the patches if a Core 2 CPU is found.
Hopefully this will allow you to add the patches into whatever
automatic build scripts you are using.

--jason

-----------------------------------------------------------
Jason Worth Martin
Asst. Prof. of Mathematics
James Madison University
http://www.math.jmu.edu/~martin

"Ever my heart rises as we draw near the mountains.
There is good rock here." -- Gimli, son of Gloin