mul_1 2-way on T3

Wed Apr 3 02:04:19 CEST 2013

So I've been toying with this loop:

1:
	mulx	u0, v0, %l2
	sub	n, -(2 * 8), n

	umulxhi	u0, v0, %l5
	ldx	[n + u0_off], u0

	mulx	u1, v0, %l3
	addxccc	%l2, %o5, r0

	umulxhi u1, v0, %o5
	ldx	[n + u1_off], u1

	addxccc	%l3, %l5, r1
	stx	r0, [n + r0_off]

	brlez	n, 1b
	 stx	r1, [n + r1_off]

It's an attempt to do the rshift.asm trick for mul_1.asm

Theoretically it should be possible to get the above to execute at 6
cycles per loop, the load distances can be placed perfectly.

However the results I get is that 2/3 of the time it executes in the
expected 6 cycles, but 1/3 of the time it executes in 7 cycles.  I
must be bumping up against the OoO limits or something like that.

This still would be faster than the current mul_1.asm code, but not as
perfectly performing as your mul_1a.asm variant.

Anyways, just thought I'd pass this along.  I'll keep playing with it
and if I can get it to run consistently in 6 cycles per loop we should
seriously consider taking this approach.