GMP «Arithmetic without limitations» GMP developers' SPARC corner



SPARC core pipeline overview

US1 US2 US3 US4 T1-T2 T3 T4-T5
issue width 4 (2I,1LS,2FP) 4 (2I,1LS,2FP) 1 1 2
issue order in order in order in order in order out-of-order

FP=floating point, LS=load/stor, I=intop

SPARC optimisation motivation

SPARC chips before T4 under-perform on GMP. This is not because the GMP code is inadequately optimised for SPARC, but due to the basic v9 ISA as well as the micro-architecture of these chips. The T1 and T2 chips perform worse than any other SPARC chips; they compare to a 15 year older 486 chip.

The T4/T5 are completely different, and are not at all bad GMP performers; they are now not much slower than a concurrent PC (using GMP repo code for SPARC). These CPUs are just 2-issue and can perform just one 64-bit ld/st per cycle, but they are out-of-order and have a fully pipelined integer multiply unit, albeit with an extreme latency of 12 cycles. Unlike older SPARCs, they (and T3) have an instruction umulxhi for producing the upper half of a 64 × 64 multiply, and a 64-bit add-with-carry (but no corresponding subtract).

The T4 and T5 have zany instructions for directly performing bignum multiplies of up to 2048 bits, optionally with a 2-adic modulo. The design of these instructions is quite unsuitable for GMP, due to their register-based interface as well as > 100 cycle loadup, startup, and write-back overhead. Furthermore, the operands need to be the same number of limbs. The register-based interface requires special code for every limb count.

T4-T5 projects

DONE:

Last modified: 2016-12-17