mul_1 2-way on T3

Wed Apr 3 13:56:41 CEST 2013

David Miller <davem at davemloft.net> writes:

  I'm convinced now that mul_1a.asm is the only loop which will execute
  optimally.

The a variant schedules loads 8 to 10 cycles from its mul dependees.
The b variant schedules loads 3 to 6 cycles from its mul dependees.

The longer scheduling numbers happen when the latency crosses the loop
bookkeeping insn block.

I would assume that a compromise between a and b which schedules loads 6
to 8 cycles would also run at 3 c/l.  The L1d latency is "just" 5 cycles
after all.

Now, I would claim that this is not "optimal", since using 4 ptrs for up
and 4 ptrs for rp, as per your previous suggestion, we could cut a cycle
per iteration for 2.75 c/l for 4-way unrolling, but at a setup cost.
Alternatively, we could 256-way unroll this without executing from L2...
That'd run in 2.50391 c/l.

-- 
Torbjörn