[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 21:46:53 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Fri, 04 Jan 2013 14:54:15 +0100

> (For modexp, I assume one can stay in registers, making this
> overhead small when using a large exponent, such as RSA
> signing/decryption.)

The montmul and montsqr instructions are meant to be used in
a sort of byte-code'ish way.

You have stubs that stay in the lowest register windows
executing montmul and montsqr instructions over and over
again, reusing the results of a previous computation as
inputs to the next computation.

That way you don't have to pull all the way out of the
register windows, and you only have the load in perhaps
one argument's worth of inputs.

Even just a straightforward implementation that just does
full montmul/montsqr calls gets pretty good performance
for 1024 bit and larger keys, yes even better than discrete
code that uses pipelines loops of mulx/umulxhi.

All of this is in the current openssl CVS sources if you're
curious.