|GMP developers' ARM corner
A7 A8 A9 A15 issue width 1-2 2 2 3 issue order in order in order limited
out-of-order Neon bits (most insn) 64 64 64 128 Neon bits/cycle (shifts imm count) 64 64 64 128 Neon bits/cycle (shifts reg count) 64 64 64 64
Recent ARM 32-bit CPUs have great GMP performance potential, far better than any other 32-bit processors. Both the A9 and A15 can sustain 1/2 32 × 32 → 64 multiply per cycle using the core instruction set. Furthermore, A9 and A15 can sustain one respectively two such multiply operations per cycle using the Neon extensions. The core and Neon multiply units are independent, meaning that A9 can sustain 1.5 mulops/cycle and A15 can sustain 2.5 mulops/cycle.
The current GMP code utilises the mulop bandwidth very poorly. The goal of this project is to utilise the hardware better, both the multiply hardware and the hardware for other critical operations.
The recent A15 progress wrt mpn_mul_1 and mpn_addmul_1 (see mailing list) has obsoleted many asm functions: mpn_rshift, mpn_addlsh1, mpn_addlsh2, mpn_cnd_add_n, and could obsolete also mpn_lshift, and perhaps also various sub/rsb functions.
Somewhat surprisingly, the Neon unit has better multiply throughput than shift throughput, perhaps making multiply-based mpn_lshift and mpn_rshift the optimal approach. An alternative is to use 64-bit shifting insns (allowing accurate destination sub-register) and 128-bit everything else.
Even with a properly designed architecture like ARM/Neon, high-performance GMP code using Neon tend to be complicated, requiring very deep software pipelining. To avoid poor small operand performance, we need to use as shallow software pipelining as possible, and carefully design feed-in and wind-down code. If small operand performance is nevertheless worse than plain code, we need to provide special, well-optimised basecase code. Such basecase code is not an alternative to low-overhead Neon code, but might simplify the Neon code since it will not need to handle tiny operands.