mul_1 2-way on T3
tg at gmplib.org
Wed Apr 3 13:56:41 CEST 2013
David Miller <davem at davemloft.net> writes:
I'm convinced now that mul_1a.asm is the only loop which will execute
The a variant schedules loads 8 to 10 cycles from its mul dependees.
The b variant schedules loads 3 to 6 cycles from its mul dependees.
The longer scheduling numbers happen when the latency crosses the loop
bookkeeping insn block.
I would assume that a compromise between a and b which schedules loads 6
to 8 cycles would also run at 3 c/l. The L1d latency is "just" 5 cycles
Now, I would claim that this is not "optimal", since using 4 ptrs for up
and 4 ptrs for rp, as per your previous suggestion, we could cut a cycle
per iteration for 2.75 c/l for 4-way unrolling, but at a setup cost.
Alternatively, we could 256-way unroll this without executing from L2...
That'd run in 2.50391 c/l.
More information about the gmp-devel