[PATCH] Optimize 32-bit sparc T1 multiply routines.
David Miller
davem at davemloft.net
Fri Jan 4 10:07:13 CET 2013
From: nisse at lysator.liu.se (Niels Möller)
Date: Fri, 04 Jan 2013 09:10:30 +0100
> David Miller <davem at davemloft.net> writes:
>
>> That's why realistically I'll probably only use mpmul for 3x3 and
>> larger.
>
> So, e.g., an mpn_addmul_4 would make sense (and up to mpn_addmul_32, if
> you want to make maximal use of mpmul...)? I don't know anything about
> these sparc instructions beyond what you're explaining now, but it
> sounds like it could be advantageous to have one operand invariant in
> the loop.
I still think using it in mpn_mul_basecase when N==M is still going to
be the best usage of this instruction.
The issue is that every time you want to use 'mpmul' you have to do
something like (this is for a 4x4 limb multiply):
ldx [MULTIPLIER + 0x00], %o0
ldx [MULTIPLIER + 0x08], %o1
ldx [MULTIPLIER + 0x10], %o2
ldx [MULTIPLIER + 0x18], %o3
...
save %sp, -176, %sp
ldx [MULTIPLICAND + 0x00], %l0
ldx [MULTIPLICAND + 0x08], %l1
ldx [MULTIPLICAND + 0x10], %l2
ldx [MULTIPLICAND + 0x18], %l3
...
save %sp, -176, %sp
save %sp, -176, %sp
save %sp, -176, %sp
save %sp, -176, %sp
save %sp, -176, %sp
mpmul 3 ! The immediate field is "N - 1"
restore
restore
restore
restore
stx %l0, [PRODUCT + 0x00]
stx %l1, [PRODUCT + 0x08]
stx %l2, [PRODUCT + 0x10]
stx %l3, [PRODUCT + 0x18]
stx %l4, [PRODUCT + 0x20]
stx %l5, [PRODUCT + 0x28]
stx %l6, [PRODUCT + 0x30]
stx %l7, [PRODUCT + 0x38]
restore
restore
The circuit does scale very well, for example here are cycle counts
for just the 'mpmul' instruction itself for N from 1 to 16:
79
84
90
98
108
120
134
150
168
188
210
234
260
288
318
350
Anyways I obviously have a lot of experimenting and tinkering to do,
so I'll come back here once I have a better idea of how we might use
'mpmul' most effectively.
Thanks.
More information about the gmp-devel
mailing list