AMD bulldozer and GMP

Torbjorn Granlund tg at gmplib.org
Tue Feb 14 22:28:14 CET 2012


I've had the opportunity to measure GMP's performance on AMD's new
high-end processor line, AMD FX, a k a Bulldozer.

One should keep in mind that AMD has been in the lead in integer number
crunching since the original athlon (K7) came out, thanks to the
superior handling of integer multiplication and add-with-carry, and
AMD's 3-way integer issue (compared to 2-3 for Intel, until Sandybridge
with is fully 3-way).

The GMPbench results for Bulldozer is 36% worse than K10, clock-for-
clock.  To match this poor clock-for-clock performance, on needs to look
at very low-power processors like AMD bobcat and VIA nano, or the 15
year old Alpha ev6.

See: http://gmplib.org/gmpbench.html

I explored some pipeline characteristics for an explanation.  (1) AMD
now handle just 2 integer insns/cycle.  (2) Integer multiply is poorly
pipelined, with a throughput of 1 every 4th cycle, and latency is 6-7
cycles.  (K8/K10 had 1/2 and 4-5 cycles, respectively.)

See: http://gmplib.org/~tege/x86-timing.pdf

(Only the first table has been updated for Bulldozer, its column is
"BD1".)

Timing numbers for GMP primitives are no better, they match bobcat
more-or-less.

See: http://gmplib.org/devel/asm.html

The GMP numbers will improve with time, to about the level of an old
Intel Core2 (i.e., two full tick-tock generations back).

It is totally incomprehensible what AMD is doing.  The new processor
runs hot, slowly, and hardly outperforms a 5W processor for integer
number crunching.  OK, they do, thanks to a 2x clock and a more cores.
But clock-for-clock they are equal.

They need to go back to the K10 line, replace its aging branch
predictors, replace the 2-way associative L1 data cache, and soup up its
prefetching logic.  This'll keep the powerfull ALUs busy.  Then
implement the whole thing in current silicon technology, and they'll
have a great CPU again.  I suppose they'll need to implement the latest
few hundred SSE and AVX instructions too, which will mess up the floor
plan.

If you consider purchasing a machine for integer number crunching, you
need to get one with K10, i.e., AMD Phenom or Opteron 61xx, or Intel
Sandybridge (socket 1155, socket 2011).  Alternatively, go for low power
and good performance per Watt, and get a bunch of VIA nano or AMD
bobcat.  The main drawback for the latter two is that they don't support
ECC memory (but that's a problem with most Intel platforms too).

-- 
Torbjörn


More information about the gmp-discuss mailing list