GMP loop improvements

Torbjorn Granlund
03 Mar 2003 15:09:26 +0100

I plan to add more sophisticated assembly for 4.2, which should speed
basecase multiplication and REDC by a great deal.

Currently, typical RISC assembly routines have a structure like this:


This structure is particularly common for mpn_addmul_1, the most
critical routine in GMP.

The "Loop" block is unrolled and pipelined to run at great speed.  If
the Loop block has k consecutive iterations in some stage of
completion at any given time, we need k feed-in stages for gradually
starting the iterations, and similarly k wind-down stages for
finishing things up.

The various Feed-in and Wind-down blocks are "plucked" versions of the
Loop block.

This structure works wonderfully for large loop counts.  But feed-in
and wind-down together typically have a very high cost, adding an
overhead of tens of cycles.

(At the MUL_KARATSUBA_THRESHOLD, these routines run at less than half
the peak speed on some machines.  Unfortunately, a vast majority of
mpn_addmul_1 calls will be with counts < MUL_KARATSUBA_THRESHOLD.  For
really small operands, under 10 limbs, some machines' routines run at
less than 1/4 of the peak speed.)

Now, if we notice that feed-in phase p and wind-down phase p exactly
complement each other, together containing the same instructions as
the loop block (except for loop control instructions), an idea for
speeding things becomes apparent...

The idea is to allow two consecutive mpn_addmul_1 calls to overlap.
Wind-down phase p from mpn_addmul_1 call q would be scheduled with
feed-in phase p from mpn_addmul_1 call q+1.

This would work anytime mpn_addmul_1 is used repeatedly, such as in
schoolbook multiplication (mpn_mul_basecase) and in REDC (aka
Montgomery reduction, see mpz/powm.c).

Without much labour and minimal code expansion, we could transform the
various mpn_addmul_1 routines into mpn_mul_basecase routines and gain
a neat speedup of GMP.  For medium size operands on some machines, the
multiply speed would more than double.  For operands in the Karatsuba
range, the speedup would still be very significant.

With somewhat more labour and more code expansion, we could create
assembly REDC routines, mpn_redc_i1.  The code would be very similar
to mpn_mul_basecase.

If there are any assembly programmers out there, please jump in and
help!  This is needed for Alpha, HPPA, PowerPC, SPARC, and Itanium.
There is a lot of assembly programming needed for 4.2, if we're going
to do everything I'd like to do...