[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 21:42:31 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Fri, 04 Jan 2013 14:54:15 +0100

> Did you add umulxhi use in your patch from a few days ago?

Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the
multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4
cycles per limb in a two-way unrolled loop in mul_1.  addmul_1 gets
about 6.5 cycles per limb.

The lack of subxccc is not a real restriction for submul_1, it only
actually needs subcc (which all v9 chips have) and 'addxc'.  You can
take a look at my patch to see this.

Nevertheless the chip folks are fully aware of this lack of symmetry
and will add the subxc and subxccc instructions at some point in the
future.

I am more than well aware that the design they choose for these
mpmul/montmul/montsqr instructions has the downside that it allows no
overlap of the memory loads and stores with the computation.

I think that the people who designed these instructions were aware of
these issues are well, and it was simply a tradeoff.