Sandybridge addmul_N challenge

Mon Feb 20 12:12:22 CET 2012

The two high-end architectures for GMP are AMD K8-K10 (i.e., all Opteron
except 62xx, Athlon 64, Athlon X2, Phenom, Phenom II) and Intel
Sandybridge (i.e., socket 1155 and 2011 Core i3,i5,i7).

We have great multiplication loops for K8-K10, addmul_1 runs at 2.5 c/l
and addmul_2 runs at 2.375 c/l.  (These loops are then used in
mul_basecaee, sqr_basecase, redc_1, redc_2, and a few other places.)

But our multiplication loops for Sandybridge are much worse.  The
current addmul_1 runs at 4 c/l and addmul_2 runs at about 3.4 c/l.  I
have new code running at 3.4 c/l and 3.3 c/l respectively.

The critical instructions for these loops are MUL and ADC.  The
throughput of MUL is great for both CPUs (actually better on Sandybridge
than K8-K10).  ADC is more tricky; AMD can issue 3 per cycle with a
latency of 1 cycle, bit for Intel the situation is trickier:

In all cases the carry-out has a latency of 1 cycle.  For "ADC $0,dreg"
the latency of dreg is one cycle, but for "ADC sreg,dreg" it is two
cycles.  (It is also 2 cycles for "ADC $const,dreg" when const != 0.)

The challenge is to beat 3 c/l with either addmul_1 or addmul_2.

Success will boost GMP's general performance on these processors, since
every higher-level operation depends on these lowest-level multiply
primitives.

-- 
Torbjörn