GMP performance on new processors
Torbjorn Granlund
tg at gmplib.org
Sun Apr 10 21:04:16 CEST 2011
AMD's processors have for a decade or more been the choice for bignum
crunching, at least as long as one does not depend on floating-point
based FFTs.
The main reason for AMD's excellence in this area is that they have
implemented the MUL word x word -> doubleword instruction well, i.e.,
with low latency and with high repeat rate.
Intel has not given the same attention to the MUL instruction; further-
more they have had ADC (add with carry) and SBB (subtract with borrow)
that need at least 2 cycles when put in a dependent chain.
The latest Sandy Bridge processors finally have a better MUL, and
ADC/SBB have also been improved to solve many of the 2 cycle latency
situations. The previous generation Nehalem/Westmere processors was
also an improvement over their predecessors; there is a gradual
improvement being implemented.
AMD on the other hand have had good MUL and ADC/SBB support since the
k7 days. All their 64-bit processors (K8-K10) have excellent MUL with
latency of 4-5 cycles and a repeat rate of one insn each 2nd cycle.
Now it seems AMD is taking the opposite path with its latest two
processors, the low-end Bobcat and the high-end Bulldozer.
We have a Bobcat system already in the GMP test array; its MUL can be
repeated every 5th cycle. This is not bad for a low-power processor
(Intel's Atom can repeat the insn every 18th cycle...).
What is worse--for GMP--is that Bulldozer also seems to come with a low
performance MUL; it has a long latency of 6 cycles and a repeat rate of
just one every 4th cycle. This will translate to a major drop in GMP
performance on the new processor, when compared to its performance on
existing high-end AMD K10 processors.
http://support.amd.com/us/Processor_TechDocs/47414.pdf
--
Torbjörn
More information about the gmp-devel
mailing list