fast gcd

Tue Oct 13 17:12:19 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

> (b) mpz_bgcd (or more exactly the inner hgcd routine) only computes the 2x2
>     matrix, not the corresponding terms of the remainder sequence, which are
>     recomputed if needed from that matrix
> (c) for the matrix-vector multiplication in (b), mpz_bgcd uses the wrap-around
>     trick together with FFT mod 2^N+1

Nice to get a confirmation that this organization gives an improvement
also in practice. My implementation of bgcd, which didn't do this, was
roughly 10% slower than the left-to-right version, iirc.

One question I've had is what's best for medium size numbers (when we
don't want to use quadratic gcd, but also are not in the range where
FFT is used for the multiplications in the subquadratic algorithm). Is
it good to use the (b) variant all the way, or should one return the
corresponding remainders when input size is below some threshold?

> Since (b) and (c) can be done in the classical case too, I believe the gcd code
> in GMP can still be improved.

The algorithm currently in gmp can be adapted to do that.

If we compute full remainders, like the current code, there's a
condition defining "correctness" of the returned values, and the stop
condition in hgcd results in a "maximal" matrix. Maximal means that
hgcd returns (a, b) and a corresponding matrix, such that returning
instead (min(a,b), |a-b|) and the corresponding matrix would violate
the correctness condition.

If we rearrange the algorithm to return the matrix only, we have to do
it in such a way that the returned matrix is always correct, but not
guaranteed to be maximal. This is on the GCD todo list, and should be
fairly straight-forward.

What's needed to gain from this is a primitive that computes a b - c d
when the result is known to be non-negative and of a given size, with
significant cancellation at the most significant words. This primitive
can then use mullo, or fft wraparound, or any other useful tricks
depending on the input size. Better algorithms for multiplication mod
2^n ± 1 for sizes below the fft range (as was discussed a few days
ago) would also help.

It's actually not so easy to do the same thing with classical
Schönhage gcd. The problem here is that it's difficult to guarantee
correctness without computing and checking the remainders.

One other questions: In the outer gcd loop, how large fraction do you
operate on with hgcd in each iteration? In current gmp, the fraction
for gcd is 1/3, which seemed close to optimal when I benchmarked it.

For gcdext, its a more difficult to get this right. The problem is
that during the loop, the remainders naturally shrink, but the
cofactors grow. To take cost of updating the cofactors into account,
the optimal fraction is most likely a function both on the current
remainder size and the current cofactor size. Current method for
selecting the cutting point for each outer iteration of gcdext can
surely be improved.

(If anybody wants to see the full gcd todo file, let me know).

Regards,
/Niels