[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 09:00:53 CET 2013

From: nisse at lysator.liu.se (Niels Möller)
Date: Fri, 04 Jan 2013 08:48:21 +0100

> David Miller <davem at davemloft.net> writes:
> 
>> Just FYI, I'm also working on an mpn_mul_basecase that makes use of
>> the T4 'mpmul' instruction which can do NxN 64-bit limb multiplies
>> for values of N from 1 to 32.
> 
> It might make sense to experiment with an mpn_addmul_2 before doing
> mpn_mul_basecase.

The tradeoff of when mpmul is faster than a flat-out mulx/umulxhi loop
is beyond 2x2 limbs, so I don't see any value in looking into that
just yet.

There's a lot of setup and teardown associated with using mpmul
because it uses several register windows and some of the floating
point registers to hold the entire set of inputs, and to provide the
result.

That's why realistically I'll probably only use mpmul for 3x3 and
larger.