# [PATCH] Optimize 32-bit sparc T1 multiply routines.

David Miller davem at davemloft.net
Fri Jan 4 10:07:13 CET 2013

```From: nisse at lysator.liu.se (Niels Möller)
Date: Fri, 04 Jan 2013 09:10:30 +0100

> David Miller <davem at davemloft.net> writes:
>
>> That's why realistically I'll probably only use mpmul for 3x3 and
>> larger.
>
> So, e.g., an mpn_addmul_4 would make sense (and up to mpn_addmul_32, if
> you want to make maximal use of mpmul...)? I don't know anything about
> these sparc instructions beyond what you're explaining now, but it
> sounds like it could be advantageous to have one operand invariant in
> the loop.

I still think using it in mpn_mul_basecase when N==M is still going to
be the best usage of this instruction.

The issue is that every time you want to use 'mpmul' you have to do
something like (this is for a 4x4 limb multiply):

ldx	[MULTIPLIER + 0x00], %o0
ldx	[MULTIPLIER + 0x08], %o1
ldx	[MULTIPLIER + 0x10], %o2
ldx	[MULTIPLIER + 0x18], %o3
...
save	%sp, -176, %sp
ldx	[MULTIPLICAND + 0x00], %l0
ldx	[MULTIPLICAND + 0x08], %l1
ldx	[MULTIPLICAND + 0x10], %l2
ldx	[MULTIPLICAND + 0x18], %l3
...
save	%sp, -176, %sp
save	%sp, -176, %sp
save	%sp, -176, %sp
save	%sp, -176, %sp
save	%sp, -176, %sp
mpmul	3			! The immediate field is "N - 1"
restore
restore
restore
restore
stx	%l0, [PRODUCT + 0x00]
stx	%l1, [PRODUCT + 0x08]
stx	%l2, [PRODUCT + 0x10]
stx	%l3, [PRODUCT + 0x18]
stx	%l4, [PRODUCT + 0x20]
stx	%l5, [PRODUCT + 0x28]
stx	%l6, [PRODUCT + 0x30]
stx	%l7, [PRODUCT + 0x38]
restore
restore

The circuit does scale very well, for example here are cycle counts
for just the 'mpmul' instruction itself for N from 1 to 16:

79
84
90
98
108
120
134
150
168
188
210
234
260
288
318
350

Anyways I obviously have a lot of experimenting and tinkering to do,
so I'll come back here once I have a better idea of how we might use
'mpmul' most effectively.

Thanks.
```