[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 03:54:45 CET 2013

From: bodrato at mail.dm.unipi.it
Date: Fri, 4 Jan 2013 02:57:50 +0100 (CET)

> Il Ven, 4 Gennaio 2013 1:49 am, David Miller ha scritto:
>> Just FYI, I'm also working on an mpn_mul_basecase that makes use of
>> the T4 'mpmul' instruction which can do NxN 64-bit limb multiplies
>> for values of N from 1 to 32.
> 
> Great! Maybe it can be useful also for mul_2 or higher.

Indeed.  One of the things I need to work on is determining where the
cut-off is for when 'mpmul' is actually faster than the usual
mulx/umulxhi implementation.

>> It's an instruction that seems like it was designed specifically for
>> libgmp :-)
> 
> If it support only balanced multiplication (NxN and not NxM), its target
> probably is 2048-bit public-key crypto.

The chip has seperate montgomery multiply and montgomery squaring
instructions for public-key crypto, and they are already in use in
the openssl tree for example.

Yes, the mpmul instruction is limited to balanced NxN multiplies.

Well, actually, we could use this mpmul instruction for NxM cases by
padding the unused parameters with zeros.  That way we could support
any case where N <= 32 and M <= 32.

> Should we add a balanced only mul_basecase_n function, to be used by
> mul_n, to fully exploit such an instruction? Modular arithmetic (crypto,
> ECM, etc.) can benefit of such an approach. How much faster than a
> fully-flexible mul_basecase would it be?

Making this for crypto would be of no value for T4, because as
mentioned the chip has other instructions that more directly support
modular arithmetic in the form of 'montmul' and 'montsqr'
instructions.