mul_1 2-way on T3
davem at davemloft.net
Wed Apr 3 02:04:19 CEST 2013
So I've been toying with this loop:
mulx u0, v0, %l2
sub n, -(2 * 8), n
umulxhi u0, v0, %l5
ldx [n + u0_off], u0
mulx u1, v0, %l3
addxccc %l2, %o5, r0
umulxhi u1, v0, %o5
ldx [n + u1_off], u1
addxccc %l3, %l5, r1
stx r0, [n + r0_off]
brlez n, 1b
stx r1, [n + r1_off]
It's an attempt to do the rshift.asm trick for mul_1.asm
Theoretically it should be possible to get the above to execute at 6
cycles per loop, the load distances can be placed perfectly.
However the results I get is that 2/3 of the time it executes in the
expected 6 cycles, but 1/3 of the time it executes in 7 cycles. I
must be bumping up against the OoO limits or something like that.
This still would be faster than the current mul_1.asm code, but not as
perfectly performing as your mul_1a.asm variant.
Anyways, just thought I'd pass this along. I'll keep playing with it
and if I can get it to run consistently in 6 cycles per loop we should
seriously consider taking this approach.
More information about the gmp-devel