[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 10:07:13 CET 2013

From: nisse at lysator.liu.se (Niels Möller)
Date: Fri, 04 Jan 2013 09:10:30 +0100

> David Miller <davem at davemloft.net> writes:
> 
>> That's why realistically I'll probably only use mpmul for 3x3 and
>> larger.
> 
> So, e.g., an mpn_addmul_4 would make sense (and up to mpn_addmul_32, if
> you want to make maximal use of mpmul...)? I don't know anything about
> these sparc instructions beyond what you're explaining now, but it
> sounds like it could be advantageous to have one operand invariant in
> the loop.

I still think using it in mpn_mul_basecase when N==M is still going to
be the best usage of this instruction.

The issue is that every time you want to use 'mpmul' you have to do
something like (this is for a 4x4 limb multiply):

	ldx	[MULTIPLIER + 0x00], %o0
	ldx	[MULTIPLIER + 0x08], %o1
	ldx	[MULTIPLIER + 0x10], %o2
	ldx	[MULTIPLIER + 0x18], %o3
	...
	save	%sp, -176, %sp
	ldx	[MULTIPLICAND + 0x00], %l0
	ldx	[MULTIPLICAND + 0x08], %l1
	ldx	[MULTIPLICAND + 0x10], %l2
	ldx	[MULTIPLICAND + 0x18], %l3
	...
	save	%sp, -176, %sp
	save	%sp, -176, %sp
	save	%sp, -176, %sp
	save	%sp, -176, %sp
	save	%sp, -176, %sp
	mpmul	3			! The immediate field is "N - 1"
	restore
	restore
	restore
	restore
	stx	%l0, [PRODUCT + 0x00]
	stx	%l1, [PRODUCT + 0x08]
	stx	%l2, [PRODUCT + 0x10]
	stx	%l3, [PRODUCT + 0x18]
	stx	%l4, [PRODUCT + 0x20]
	stx	%l5, [PRODUCT + 0x28]
	stx	%l6, [PRODUCT + 0x30]
	stx	%l7, [PRODUCT + 0x38]
	restore
	restore

The circuit does scale very well, for example here are cycle counts
for just the 'mpmul' instruction itself for N from 1 to 16:

79
84
90
98
108
120
134
150
168
188
210
234
260
288
318
350

Anyways I obviously have a lot of experimenting and tinkering to do,
so I'll come back here once I have a better idea of how we might use
'mpmul' most effectively.

Thanks.