[PATCH] Optimize 32-bit sparc T1 multiply routines.
Torbjorn Granlund
tg at gmplib.org
Sun Jan 6 13:08:25 CET 2013
David Miller <davem at davemloft.net> writes:
Thanks for your help, the following works. I'll work on unrolling
and scheduling it.
PROLOGUE(mpn_sub_nc)
ba,pt %xcc, L(ent)
xor cy, 1, cy
EPILOGUE()
PROLOGUE(mpn_sub_n)
mov 1, cy
L(ent): cmp %g0, cy
L(top): ldx [up+0], %o4
add up, 8, up
ldx [vp+0], %o5
add vp, 8, vp
add rp, 8, rp
add n, -1, n
xnor %o5, %g0, %o5
addxccc %o4, %o5, %g3
brgz n, L(top)
stx %g3, [rp-8]
clr %o0
retl
movcc %xcc, 1, %o0
EPILOGUE()
Since we are working with a throughput constrained pipeline, we should
really use as few insns as possible.
There are 6 operation insns, and it seems hard to use less than 5
bookkeeping insns. With k-way unrolling we should then get to
max(3,(6k+5)/(2k)) cycles/limb.
For small k, we could put the pointers the end of its operands, then use
a combined index and loop counter -n...0. This would give
max(3,(7k+1)/(2k)) cycles/limb.
(The max(3...) handles the load/store bandwidth limit. It has no
limiting effect for sub_n, but it does for add_n.)
sub_n:
n method 1 method 2
1 5.5 4.0
2 4.2 3.8
3 3.8 3.7
4 3.6 3.6
5 3.5 3.6
6 3.4 3.6
7 3.4 3.6
8 3.3 3.6
oo 3.0 3.5
add_n:
n method 1 method 2
1 5.0 3.5
2 3.8 3.2
3 3.3 3.2
4 3.1 3.1
5 3.0 3.1
6 3.0 3.1
7 3.0 3.1
8 3.0 3.1
oo 3.0 3.0
For add_n, I recommend either method 1 with 4-way unrolling, or method 2
with 2-way unrolling.
For sub_n we should use at least 4-way unrolling.
--
Torbjörn
More information about the gmp-devel
mailing list