fast gcd

Wed Oct 14 09:43:40 CEST 2009

       Niels,

since you answered on gmp-devel, let me continue the discussion there.

> Next, I looked into small-prime fft (which I had been hacking on from
> time to time earlier). I implemented it for GMP, using Cooley-Tukey
> for cache locality, and van der Hoeven's truncated fft, together with
> varied number of primes (3--5), to reduce the staircase-shape of the
> running time vs size graph. I designed an invariance interface, and
> wrote an FFT-aware variant of matrix22_mul that used invariance. The
> new FFT code is not currently in GMP, for those with access to
> shell.gmplib.org its in the ~hg/gmp-proj/gcd-nisse respository.

quite interesting. I strongly advise that GMP contains an interface that
enables one to perform FFT transforms and pointwise products. As you have
noticed, this is crucial in the development on asymptotically fast algorithms.

> Slides from a presentation a made at the department at the end can
> also be found at
> http://www.lysator.liu.se/~nisse/archive/gcd-seminar-2008-10-30.pdf

thank you for those slides, they are quite interesting!

>   Some ideas:
> 
>   * Maintain operands left-normalized (most significant bit set).

I fear this would lead to many unnecessary calls to lshift. Or you will end up
in designing a new mpn layer with MSB-aligned numbers (would be quite
interesting for MPFR).

>   * Try the alternative interface
> 
>       mp_size_t
>       mpn_hgcd (mp_srcptr ap, mp_srcptr bp, mp_size_t n,
> 		struct hgcd_matrix *M, mp_ptr tp);
> 
>     where ap and bp are not modified, and only the matrix is
>     returned. Exploit cancellation in the product M^{-1} (a;b), using
>     FFT wrap-around. Can be expected to be useful only above some
>     threshold.

this is what I did with bgcd. I found experimentally the threshold to be about
32768 bits, i.e., 512 limbs.

>   * Use FFT invariance for the matrix-vector multiplication
>     mpn_hgcd_matrix_adjust. This function is used for varying
>     balancedness of the inputs, depending also on the choice of p in
>     the gcd and gcdext outer loops.

you report a 15% speedup in your slides. Since most of the time is spent in the
matrix-vector and matrix-matrix products asymptotically, and each operand is
reused twice, we can expect a theoretical speedup of up to 50% if the pointwise
products are negligible (i.e., with an FFT of large order).

While I worked on bgcd, I also had some random ideas:

* [bgcd-specific] since bgcd works with signed numbers, can we use a
  2-complement representation, to avoid keeping separate signs with mpn?

* is it possible to compute only part of the matrix M (say 3 out of 4
  coefficients) and still be able to reconstruct efficiently M^{-1} (a;b),
  using the fact that M is unimodular?

* in the outer gcd routine, which reduces (a;b) to (a';b'), it is possible
  to compute only say a' by the recursive call to hgcd, and reconstruct b'
  afterwards?

Paul