mul_1 2-way on T3

Wed Apr 3 02:28:40 CEST 2013

From: David Miller <davem at davemloft.net>
Date: Tue, 02 Apr 2013 20:24:51 -0400 (EDT)

> Only loop like mul_1a.asm (and potentially mul_1b.asm) can, because
> only they have enough cycles in the loop to retire multiplies without
> positive accumulation into the OoO buffer.

Actually, mul1b.asm cannot achieve 6 cycles per loop because of the
3 cycle load-to-use there.

This causes a OoO queue up, which in turn makes the dependent mulx's
queue up in OoO that much longer, and their dependent instructions
likewise, and so on, and so forth, eventually overflowing the OoO
buffer.

I'm convinced now that mul_1a.asm is the only loop which will execute
optimally.