Sandy Bridge and GMP

Sat Jan 29 23:45:28 CET 2011

I had the chance to play with GMP on Intel's new processor family, Sandy
Bridge.  The results are quite interesting.

As most of you probably know, AMD processors have always run GMP better
than Intel processors.  The main reason for this is that the integer
multiply instruction on AMD's processors have a latency of 4-5 cycles
and a throughput of one insn every 2nd cycle, while Intel's processors
have had varying latency, 64-bit P6 processors have had 8 cycle (Conroe)
or 10 cycles (Nehalem, Westmere) for the upper product half, and while
recent processors (Nehalem, Westmere) had a throughput of one insn every
2nd cycle, the long latency limited their GMP performance.

Another reason for AMD's ede is add-with-carry and subtract-with-borrow
latency.  All P6 family processors have a 2 cycle latency, while all AMD
processors have a 1 cycle latency.

But Sandy Bridge is different.  Its integer multiply unit is one cycle
faster than AMD's processors' units, just 3-4 cycles, and furthermore
the throughput is one insn each cycle.  The add-with-carry latency is
somewhat puzzling, it seems to be epsilon less than 2 cycles, but my
measurements might be messed up by the auto overclocking features
("turbo mode").

The Sandy Bridge pipeline is not even similar to Nehalem's pipeline or
the pipeline of any other Intel processors.  Not everything has been
improved; the integer shift instructions has longer latency and worse
throughput, which hurts a few GMP operations.

AMD still rules, but Intel is getting closer.  A few multiply-dependent
GMP operations actually run faster on the new Intel processor than AMD's
current processors.

Initial GMP mpn measurements:
    http://gmplib.org/devel/asm.html

More cycle numbers can be found here:
    http://gmplib.org/~tege/x86-timing.pdf

(Sandy Bridge is "SBR" here, Nehalem is "NHM".  Both are sold as Core
i3, i5, and i7.)

-- 
Torbjörn