SPARC core pipeline overview
| US1 US2 | US3 US4 | T1-T2 | T3 | T4-T5
|
issue width | 4 (2I,1LS,2FP) | 4 (2I,1LS,2FP) | 1 | 1 | 2
|
issue order | in order | in order | in order | in order | out-of-order
|
FP=floating point, LS=load/stor, I=intop
SPARC optimisation motivation
SPARC chips before T4 under-perform on GMP. This is not because the GMP code is
inadequately optimised for SPARC, but due to the basic v9 ISA as well as the
micro-architecture of these chips. The T1 and T2 chips perform worse than any
other SPARC chips; they compare to a 15 year older 486 chip.
The T4/T5 are completely different, and are not at all bad GMP performers; they
are now not much slower than a concurrent PC (using GMP repo code for SPARC).
These CPUs are just 2-issue and can perform just one 64-bit ld/st per cycle,
but they are out-of-order and have a fully pipelined integer multiply unit,
albeit with an extreme latency of 12 cycles. Unlike older SPARCs, they (and
T3) have an instruction umulxhi for producing the upper half of a 64 × 64
multiply, and a 64-bit add-with-carry (but no corresponding subtract).
The T4 and T5 have zany instructions for directly performing bignum multiplies
of up to 2048 bits, optionally with a 2-adic modulo. The design of these
instructions is quite unsuitable for GMP, due to their register-based interface
as well as > 100 cycle loadup, startup, and write-back overhead.
Furthermore, the operands need to be the same number of limbs. The
register-based interface requires special code for every limb count.
T4-T5 projects
- Several mul primitives run one or two cycles slower per iteration than
anticipated. It does not seem to be directly latency scheduling related.
- Explore using the mpmul instruction for mpn_mul_basecase. Since mpmul
handles just same-size operands, a GMP {up,un} × {vp,vn} multiply (where
un ≥ vn) will require an initial {up,vn} × {vp,vn} multiply, then
will be left with {up,un-vn} × {vp,vn}. If un-vn ≥ vn, then we
should probably loop over mpmul, and add its low vn limbs to the upper part
of the previous product, and just store the high vn limb. Considering that
vn will never be > MUL_TOOM22_THRESHOLD, i.e., rather small, it
is unlikely that what now remains should be done with mpmul, but rather
with a loop over mpn_addmul_2 (or whatever largest mpn_addmul_k we
might have). Optimisation: If un = vn + k, and k is small, pad the vp
operand with some zero limbs when loading it to registers, and truncate the
result.
- Consider using the mpmul instruction for mpn_sqr_basecase, for
large-enough operands. Unlike for mul_basecase, this will need just a
simple cutoff point to a discrete sqr_bascase loop.
- Implement dual-limb inverse "pi2" Euclidean and Hensel division
primitives. This will double small-divisor division performance, since the
mulx/umulxhi insn latency is causing poor single-limb inverse "pi1" performance.
DONE:
- We have two new mpn addmul_1 variants, one 2-way unrolled and one 4-way
unrolled. Commit the one which gives best performance for the critical
operand sizes.
- Rewrite mpn_submul_1 for speedup 5.8 c/l → 4.5 c/l.
- Finish the cnd_aors_n.asm code.
- Write a generic file aorsorrlshC_n.asm for addlsh1_n, sublsh1_n,
rsblsh1_n. addlsh2_n, etc. Performance goal: 4 c/l for add/sub, 4.5 for
rsb. To reach 4 c/l for sub, one needs to merge shifted limbs with xnor,
for a free complement.