Possible new T3-T5 mul_1
Torbjorn Granlund
tg at gmplib.org
Tue Apr 2 20:24:21 CEST 2013
David Miller <davem at davemloft.net> writes:
See attached, looks like mul1b isn't able to reach 3 c/l like mul1a can.
overhead 6.00 cycles, precision 10000000 units of 3.51e-10 secs, CPU freq 2847.41 MHz
Darn. Is the load latency > 3 cycles?
The old code had a load-use schedule of 8 cycles, the new code 3.
Both variants schedule mulx/umulxhi 4 cycles from dependees.
Could you please run these two code snippets and time them?
.global main
main: sethi %hi(2800000000), %g5
1: mulx %g1, %g1, %g1
mulx %g1, %g1, %g1
mulx %g1, %g1, %g1
mulx %g1, %g1, %g1
brnz %g5, 1b
dec %g5
retl
nop
main: sethi %hi(2800000000), %g5
1: mulx %g1, %g1, %g1
add %g1, %g1, %g1
mulx %g1, %g1, %g1
add %g1, %g1, %g1
mulx %g1, %g1, %g1
add %g1, %g1, %g1
mulx %g1, %g1, %g1
add %g1, %g1, %g1
brnz %g5, 1b
dec %g5
retl
nop
--
Torbjörn
More information about the gmp-devel
mailing list