mul_1 2-way on T3
davem at davemloft.net
Wed Apr 3 02:24:51 CEST 2013
From: David Miller <davem at davemloft.net>
Date: Tue, 02 Apr 2013 20:04:19 -0400 (EDT)
> I'll keep playing with it and if I can get it to run consistently in
> 6 cycles per loop we should seriously consider taking this approach.
Actually, I think I've figured some of this out.
My variant can never be made to run in 6 cycles per loop.
Only loop like mul_1a.asm (and potentially mul_1b.asm) can, because
only they have enough cycles in the loop to retire multiplies without
positive accumulation into the OoO buffer.
mulx and umulxhi both take 12 cycles to complete, and there are
exactly 12 cycles between mulx issue and the reissue of the same
mulx in your loops.
So it looks like, for multiplies, we need at least 4 way unrolling to
issue optimally and there is no way around this.
More information about the gmp-devel