GMP performance on new processors

Sun Apr 10 21:04:16 CEST 2011

AMD's processors have for a decade or more been the choice for bignum
crunching, at least as long as one does not depend on floating-point
based FFTs.

The main reason for AMD's excellence in this area is that they have
implemented the MUL word x word -> doubleword instruction well, i.e.,
with low latency and with high repeat rate.

Intel has not given the same attention to the MUL instruction; further-
more they have had ADC (add with carry) and SBB (subtract with borrow)
that need at least 2 cycles when put in a dependent chain.

The latest Sandy Bridge processors finally have a better MUL, and
ADC/SBB have also been improved to solve many of the 2 cycle latency
situations.  The previous generation Nehalem/Westmere processors was
also an improvement over their predecessors; there is a gradual
improvement being implemented.

AMD on the other hand have had good MUL and ADC/SBB support since the
k7 days.  All their 64-bit processors (K8-K10) have excellent MUL with
latency of 4-5 cycles and a repeat rate of one insn each 2nd cycle.

Now it seems AMD is taking the opposite path with its latest two
processors, the low-end Bobcat and the high-end Bulldozer.

We have a Bobcat system already in the GMP test array; its MUL can be
repeated every 5th cycle.  This is not bad for a low-power processor
(Intel's Atom can repeat the insn every 18th cycle...).

What is worse--for GMP--is that Bulldozer also seems to come with a low
performance MUL; it has a long latency of 6 cycles and a repeat rate of
just one every 4th cycle.  This will translate to a major drop in GMP
performance on the new processor, when compared to its performance on
existing high-end AMD K10 processors.

http://support.amd.com/us/Processor_TechDocs/47414.pdf

-- 
Torbjörn