GMP developers' SPARC corner

SPARC core pipeline overview

US1 US2 US3 US4 T1-T2 T3 T4-T5

issue width 4 (2I,1LS,2FP) 4 (2I,1LS,2FP) 1 1 2

issue order in order in order in order in order out-of-order

FP=floating point, LS=load/stor, I=intop

	US1 US2	US3 US4	T1-T2	T3	T4-T5
issue width	4 (2I,1LS,2FP)	4 (2I,1LS,2FP)	1	1	2
issue order	in order	in order	in order	in order	out-of-order

SPARC optimisation motivation

SPARC chips before T4 under-perform on GMP. This is not because the GMP code is inadequately optimised for SPARC, but due to the basic v9 ISA as well as the micro-architecture of these chips. The T1 and T2 chips perform worse than any other SPARC chips; they compare to a 15 year older 486 chip.

The T4/T5 are completely different, and are not at all bad GMP performers; they are now not much slower than a concurrent PC (using GMP repo code for SPARC). These CPUs are just 2-issue and can perform just one 64-bit ld/st per cycle, but they are out-of-order and have a fully pipelined integer multiply unit, albeit with an extreme latency of 12 cycles. Unlike older SPARCs, they (and T3) have an instruction umulxhi for producing the upper half of a 64 × 64 multiply, and a 64-bit add-with-carry (but no corresponding subtract).

The T4 and T5 have zany instructions for directly performing bignum multiplies of up to 2048 bits, optionally with a 2-adic modulo. The design of these instructions is quite unsuitable for GMP, due to their register-based interface as well as > 100 cycle loadup, startup, and write-back overhead. Furthermore, the operands need to be the same number of limbs. The register-based interface requires special code for every limb count.

T4-T5 projects

Several mul primitives run one or two cycles slower per iteration than anticipated. It does not seem to be directly latency scheduling related.
Explore using the mpmul instruction for mpn_mul_basecase. Since mpmul handles just same-size operands, a GMP {up,un} × {vp,vn} multiply (where un ≥ vn) will require an initial {up,vn} × {vp,vn} multiply, then will be left with {up,un-vn} × {vp,vn}. If un-vn ≥ vn, then we should probably loop over mpmul, and add its low vn limbs to the upper part of the previous product, and just store the high vn limb. Considering that vn will never be > MUL_TOOM22_THRESHOLD, i.e., rather small, it is unlikely that what now remains should be done with mpmul, but rather with a loop over mpn_addmul_2 (or whatever largest mpn_addmul_k we might have). Optimisation: If un = vn + k, and k is small, pad the vp operand with some zero limbs when loading it to registers, and truncate the result.
Consider using the mpmul instruction for mpn_sqr_basecase, for large-enough operands. Unlike for mul_basecase, this will need just a simple cutoff point to a discrete sqr_bascase loop.
Implement dual-limb inverse "pi2" Euclidean and Hensel division primitives. This will double small-divisor division performance, since the mulx/umulxhi insn latency is causing poor single-limb inverse "pi1" performance.

DONE:

We have two new mpn addmul_1 variants, one 2-way unrolled and one 4-way unrolled. Commit the one which gives best performance for the critical operand sizes.
Rewrite mpn_submul_1 for speedup 5.8 c/l → 4.5 c/l.
Finish the cnd_aors_n.asm code.
Write a generic file aorsorrlshC_n.asm for addlsh1_n, sublsh1_n, rsblsh1_n. addlsh2_n, etc. Performance goal: 4 c/l for add/sub, 4.5 for rsb. To reach 4 c/l for sub, one needs to merge shifted limbs with xnor, for a free complement.