Sandybridge addmul_N challenge
tg at gmplib.org
Mon Feb 20 12:12:22 CET 2012
The two high-end architectures for GMP are AMD K8-K10 (i.e., all Opteron
except 62xx, Athlon 64, Athlon X2, Phenom, Phenom II) and Intel
Sandybridge (i.e., socket 1155 and 2011 Core i3,i5,i7).
We have great multiplication loops for K8-K10, addmul_1 runs at 2.5 c/l
and addmul_2 runs at 2.375 c/l. (These loops are then used in
mul_basecaee, sqr_basecase, redc_1, redc_2, and a few other places.)
But our multiplication loops for Sandybridge are much worse. The
current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l. I
have new code running at 3.4 c/l and 3.3 c/l respectively.
The critical instructions for these loops are MUL and ADC. The
throughput of MUL is great for both CPUs (actually better on Sandybridge
than K8-K10). ADC is more tricky; AMD can issue 3 per cycle with a
latency of 1 cycle, bit for Intel the situation is trickier:
In all cases the carry-out has a latency of 1 cycle. For "ADC $0,dreg"
the latency of dreg is one cycle, but for "ADC sreg,dreg" it is two
cycles. (It is also 2 cycles for "ADC $const,dreg" when const != 0.)
The challenge is to beat 3 c/l with either addmul_1 or addmul_2.
Success will boost GMP's general performance on these processors, since
every higher-level operation depends on these lowest-level multiply
More information about the gmp-devel