[PATCH] Optimize 32-bit sparc T1 multiply routines.
David Miller
davem at davemloft.net
Fri Jan 18 21:02:29 CET 2013
From: Torbjorn Granlund <tg at gmplib.org>
Date: Sun, 06 Jan 2013 13:48:42 +0100
> I recommend 4-way unrolling.
>
> The summation method of mpn/powerpc64/mode64/aorsmul_1.asm might be
> best.
Thanks for all of these pointers and suggestions.
While waiting for the FSF to execute my assignment, I tweaked my
existing 2-way unrolled mul_1 and addmul_1 loops. Currently on T4 I'm
at:
mul_1 3.8 cycles/limb
L(top):
ldx [up+0], %g1
sub n, 2, n
ldx [up+8], %o4
mulx %g1, v0, %g3
add up, 16, up
umulxhi %g1, v0, %g2
mulx %o4, v0, %g1
add rp, 16, rp
addxccc %g3, %o5, %g3
umulxhi %o4, v0, %o5
stx %g3, [rp-16]
addxccc %g1, %g2, %g1
brgz n, L(top)
stx %g1, [rp-8]
addmul_1 5.5 cycles/limb
L(top):
ldx [up+0], %l0
ldx [up+8], %l1
ldx [rp+0], %l2
ldx [rp+8], %l3
mulx %l0, v0, %o0
add up, 16, up
umulxhi %l0, v0, %o1
add rp, 16, rp
mulx %l1, v0, %o2
sub n, 2, n
umulxhi %l1, v0, %o3
addxccc %o0, %o5, %o0
addxccc %o2, %o1, %o2
addxc %g0, %o3, %o5
addcc %o0, %l2, %o0
stx %o0, [rp-16]
addxccc %o2, %l3, %o2
brgz n, L(top)
stx %o2, [rp-8]
Just FYI...
More information about the gmp-devel
mailing list