fast gcd

Wed Oct 14 11:22:30 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

>>   Some ideas:
>> 
>>   * Maintain operands left-normalized (most significant bit set).
>
> I fear this would lead to many unnecessary calls to lshift.

This comment was in the context of hgcd2, which takes two double-limb
words as imput, and returns a matrix with single-limb entries.
Improvements here affect the linear term in the running time for a
complete gcd or hgcd, for both Lehmer and the subquadratic algorithm.

> this is what I did with bgcd. I found experimentally the threshold to be about
> 32768 bits, i.e., 512 limbs.

To be compared with the HGCD_THRESHOLD, which can be a few times lower
than that. Also, below the HGCD_THRESHOLD, hgcd is computed
essentially using Lehmer's algorithm, and then the corresponding
remainders are always produced as a side product. So I'm afraid we're
going to need both interfaces and some thresholding.

>>   * Use FFT invariance for the matrix-vector multiplication
>>     mpn_hgcd_matrix_adjust. This function is used for varying
>>     balancedness of the inputs, depending also on the choice of p in
>>     the gcd and gcdext outer loops.
>
> you report a 15% speedup in your slides. Since most of the time is spent in the
> matrix-vector and matrix-matrix products asymptotically, and each operand is
> reused twice, we can expect a theoretical speedup of up to 50% if the pointwise
> products are negligible (i.e., with an FFT of large order).

This was with small prime fft, so the pointwise products are linear,
and hence negligible for large enough sizes.

There are three algorithms for matrix22_mul (the two first are
integrated into released GMP): Naive implementation with eight
multiplies, Strassen multiplication withh 7 multiplies, and
FFT-invariance version using 8 fourier transforms, 8 pointwise
multiplictions, 4 additions in the transform domain, and 4 inverse
transforms.

It's possible to combine Strassen-multiplication with FFT-invariance,
to get down to 7 pointwise multiplies, at the cost of a bunch of
additions and subtractions on transformed values. I don't think I
tried that, but I don't expect any significant gain since both
multiplication and additions are linear, and with a fairly small
difference in cost. The FFT interface I designed should support it.
The situation is different for FFT over a large coefficient ring, like
in current GMP.

> While I worked on bgcd, I also had some random ideas:
>
> * [bgcd-specific] since bgcd works with signed numbers, can we use a
>   2-complement representation, to avoid keeping separate signs with mpn?

Can you get by with only multiplications mod 2^n? Otherwise, wouldn't
you have to convert to sign/magnitude representation for each
multiplication? (I haven't thought about how two's complement interact
with FFT wraparound, but I wouldn't expect a nice match).

> * is it possible to compute only part of the matrix M (say 3 out of 4
>   coefficients) and still be able to reconstruct efficiently M^{-1} (a;b),
>   using the fact that M is unimodular?

Interesting idea. Have to consider how to do both matrix x matrix
products and matrix x vector products in that representation (the
inversion is still trivial).

When talking about alternative representations for the matrices, one
should also consider the representation proposed by Marco (A
Strassen-like Matrix Multiplication suited for squaring and highest
power computation). But when in the FFT-range and using invariance,
the gains from that would be small, at best, just as for classical
Strassen multiplication.

> * in the outer gcd routine, which reduces (a;b) to (a';b'), it is possible
>   to compute only say a' by the recursive call to hgcd, and reconstruct b'
>   afterwards?

That doesn't make much sense to me; I'd expect the effect on overall
performance of computing or omitting a reduced number in a hgcd call
should be the same for both numbers. Assuming of course that the
caller needs both numbers, but both the outer gcd loop and hgcd itself
do need them.

Regardss,
/Niels