GMP «Arithmetic without limitations» GMP developers' ARM corner



ARM core pipeline overview

A7 A8 A9 A15
issue width 1-2 2 2 3
issue order in order in order limited
out-of-order
out-of-order
Neon bits (most insn) 64 64 64 128
Neon bits/cycle (shifts imm count) 64 64 64 128
Neon bits/cycle (shifts reg count) 64 64 64 64

ARM optimisation motivation

Recent ARM 32-bit CPUs have great GMP performance potential, far better than any other 32-bit processors. Both the A9 and A15 can sustain 1/2 32 × 32 → 64 multiply per cycle using the core instruction set. Furthermore, A9 and A15 can sustain one respectively two such multiply operations per cycle using the Neon extensions. The core and Neon multiply units are independent, meaning that A9 can sustain 1.5 mulops/cycle and A15 can sustain 2.5 mulops/cycle.

The current GMP code utilises the mulop bandwidth very poorly. The goal of this project is to utilise the hardware better, both the multiply hardware and the hardware for other critical operations.

ARM Cortex-A15 projects

The recent A15 progress wrt mpn_mul_1 and mpn_addmul_1 (see mailing list) has obsoleted many asm functions: mpn_rshift, mpn_addlsh1, mpn_addlsh2, mpn_cnd_add_n, and could obsolete also mpn_lshift, and perhaps also various sub/rsb functions.

Somewhat surprisingly, the Neon unit has better multiply throughput than shift throughput, perhaps making multiply-based mpn_lshift and mpn_rshift the optimal approach. An alternative is to use 64-bit shifting insns (allowing accurate destination sub-register) and 128-bit everything else.

Even with a properly designed architecture like ARM/Neon, high-performance GMP code using Neon tend to be complicated, requiring very deep software pipelining. To avoid poor small operand performance, we need to use as shallow software pipelining as possible, and carefully design feed-in and wind-down code. If small operand performance is nevertheless worse than plain code, we need to provide special, well-optimised basecase code. Such basecase code is not an alternative to low-overhead Neon code, but might simplify the Neon code since it will not need to handle tiny operands.

TODO:

DONE:



Last modified: 2016-12-17