GMP developers' ARM corner

ARM core pipeline overview

A7 A8 A9 A15

issue width 1-2 2 2 3

issue order in order in order limited
out-of-order out-of-order

Neon bits (most insn) 64 64 64 128

Neon bits/cycle (shifts imm count) 64 64 64 128

Neon bits/cycle (shifts reg count) 64 64 64 64

	A7	A8	A9	A15
issue width	1-2	2	2	3
issue order	in order	in order	limited out-of-order	out-of-order
Neon bits (most insn)	64	64	64	128
Neon bits/cycle (shifts imm count)	64	64	64	128
Neon bits/cycle (shifts reg count)	64	64	64	64

ARM optimisation motivation

Recent ARM 32-bit CPUs have great GMP performance potential, far better than any other 32-bit processors. Both the A9 and A15 can sustain 1/2 32 × 32 → 64 multiply per cycle using the core instruction set. Furthermore, A9 and A15 can sustain one respectively two such multiply operations per cycle using the Neon extensions. The core and Neon multiply units are independent, meaning that A9 can sustain 1.5 mulops/cycle and A15 can sustain 2.5 mulops/cycle.

The current GMP code utilises the mulop bandwidth very poorly. The goal of this project is to utilise the hardware better, both the multiply hardware and the hardware for other critical operations.

ARM Cortex-A15 projects

The recent A15 progress wrt mpn_mul_1 and mpn_addmul_1 (see mailing list) has obsoleted many asm functions: mpn_rshift, mpn_addlsh1, mpn_addlsh2, mpn_cnd_add_n, and could obsolete also mpn_lshift, and perhaps also various sub/rsb functions.

Somewhat surprisingly, the Neon unit has better multiply throughput than shift throughput, perhaps making multiply-based mpn_lshift and mpn_rshift the optimal approach. An alternative is to use 64-bit shifting insns (allowing accurate destination sub-register) and 128-bit everything else.

Even with a properly designed architecture like ARM/Neon, high-performance GMP code using Neon tend to be complicated, requiring very deep software pipelining. To avoid poor small operand performance, we need to use as shallow software pipelining as possible, and carefully design feed-in and wind-down code. If small operand performance is nevertheless worse than plain code, we need to provide special, well-optimised basecase code. Such basecase code is not an alternative to low-overhead Neon code, but might simplify the Neon code since it will not need to handle tiny operands.

TODO:

Finish Neon 1.35 c/l mpn_mul_1. See the gmp-devel archives. The sw pipeline first needs to be made shallower.
Finish Neon 1.65 c/l mpn_addmul_1. See the gmp-devel archives. The sw pipeline first needs to be made shallower.
Write a Neon mpn_submul_1, starting with the 1.65 c/l addmul_1, complementing U on-the-fly. Goal performance ≤ 1.82 c/l.
Rewrite mpn_lshift, mpn_rshift, mpn_lshiftc (currently 1.5 c/l). Using Neon shift (128-bit insns split by the hardware, or 64-bit insns) sets an lower bound of 1 c/l. Using vmull.32 or vmlal.32 sets a lower bound of 0.5 c/l.
Rewrite arm/v7a/cora15/neon/aorsorrlshC_n.asm (currently 2.5 c/l). For mpn_addlshC_n we may perhaps just fall back to mpn_addmul_1 (splitting the rp operand), but the subtracting variants are a lot trickier and will need a different scheme. (Perhaps fall back to a future fast mpn_submul_1).
Write a Neon mpn_addmul_2, similar to the new mpn_addmul_1. Performance goal: 1.3 c/l.
Write mpn_addmul_[k] for k ≥ 3 running at ≤ 1 c/l. Note that the vaddw.u32 scheme of mpn_addmul_1 and mpn_addmul_2 will not work, as the 64-bit accumulator would overflow.
Write mpn_mul_basecase using the fastest mpn_addmul_[k] using overlapped software pipelining.
Write mpn_sqr_basecase using mpn_addmul_2 using overlapped software pipelining.
Write mpn_redc_1 perhaps for just a few sizes, handling just n ≤ REDC_1_TO_REDC_2_THRESHOLD.
Write mpn_redc_2.
Write mpn_mod_1s_2p, mpn_mod_1s_3p, mpn_mod_1s_4p using Neon at least for the multiplies not on the critical path. This should get us to around 1 c/l.
Write mpn_mod_34lsub1 using Neon, for 0.6 c/l on A15 (and 1.0 c/l on A9). Perhaps also write a core mpn_mod_34lsub1, using ldrd (reaching 0.9 c/l on A15), or a hybrid code/Neon variant...

DONE:

Write a core insn mpn_submul_1 based on the 2.0 c/l mpn_addmul_1.