[PATCH] Optimize 32-bit sparc T1 multiply routines.
Torbjorn Granlund
tg at gmplib.org
Sun Jan 6 13:48:42 CET 2013
David Miller <davem at davemloft.net> writes:
From: Torbjorn Granlund <tg at gmplib.org>
Date: Fri, 04 Jan 2013 14:54:15 +0100
> 1: umulxhi %g5, %g5, %g1
> mulx %g5, %g5, %g2
> umulxhi %g5, %g5, %g3
> mulx %g5, %g5, %g4
> umulxhi %g5, %g5, %i1
> mulx %g5, %g5, %i2
> umulxhi %g5, %g5, %i3
> mulx %g5, %g5, %i4
> brnz,a %i0, 1b
> addx %i0, -1, %i0
>
> ret
This runs in 9 seconds with a suitably initialized %i0
for this processor (2.8GHZ)
Not bad. It seems safe to assume the pipeline can sustain one integer
mul per cycle.
I recommend 4-way unrolling.
The summation method of mpn/powerpc64/mode64/aorsmul_1.asm might be
best. Unfortunately, its submul_1 method will not work. I cannot see
how we can avoid an extra instruction for every limb.
This is the PPC code:
L(top): mulld r0, r9, r6
mulhdu r5, r9, r6
mulld r7, r27, r6
mulhdu r8, r27, r6
ld r9, 0(up)
ld r28, 0(rp)
ld r27, 8(up)
ld r29, 8(rp)
adde r0, r0, r12
adde r7, r7, r5
mulld r5, r9, r6
mulhdu r10, r9, r6
mulld r11, r27, r6
mulhdu r12, r27, r6
ld r9, 16(up)
ld r30, 16(rp)
ld r27, 24(up)
ld r31, 24(rp)
adde r5, r5, r8
adde r11, r11, r10
addze r12, r12
ADDSUB r0, r0, r28
std r0, 0(rp)
ADDSUBC r7, r7, r29
std r7, 8(rp)
ADDSUBC r5, r5, r30
std r5, 16(rp)
ADDSUBC r11, r11, r31
std r11, 24(rp)
addi up, up, 32
addi rp, rp, 32
bdnz L(top)
I suspect one should schedule the ld insns sooner, into the mul block.
That might require a few more temp registers, which is no cost for
SPARC.
PPC SPARC
mulld mulx
mulhdu umulxhi
adde addxccc
addze addxccc %g0
ADDSUB addxc
ADDSUBC addxccc (an alias for adde in the code when doing addmul_1)
It will be 33 insns, which could take 4.25 cycles/limb. Without
addmul_2, addmul_1 will be the most critical function, and should get
lots of attention.
With addmul_2, one saves loads and stores. With addmul_(k) for large k
we will approach just two mul insns and two addxccc per cross multiply,
which I assume would take 2 cycles. An addmul_2 should take around
3+epsilon cycles for this CPU, an addmul_3 should take 2.67+epsilon and
an addmul_4 should take 2.5+epsilon.
An addmul_4 is usually smaller than addmul_1 since the former usually
requires no unrolling.
This is a complex optimisation project, in particular with the odd mpmul
insn. I suspect addmul_(k) for k up to 4 would be hard to beat in
mul_basecase, except when m=n. But for large enough m (assuming m >= n
as mul_basecase does) one could use mpmul(n) (repeated if m >= 2n), then
mpmul(m mod n) (again possible repeated), until we reach a threshold.
--
Torbjörn
More information about the gmp-devel
mailing list