fast gcd

Niels Möller nisse at lysator.liu.se
Tue Oct 13 21:58:38 CEST 2009


Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

>> (If anybody wants to see the full gcd todo file, let me know).
>
> I am interested.

Ok. First some context: Last autumn I spent two months at the computer
science department at KTH, sharing a room with Torbjörns and another
PhD student, and hacking on GMP. What I did during that time:
Implemented gcdext on top of the new hgcd algorithm (HGCD-D in the
Math. Comp paper). Integrated the new gcd and gcdext code in GMP
(replacing my previous and much more complicated Schönhage GCD which I
don't think made it into any GMP release).

Next, I looked into small-prime fft (which I had been hacking on from
time to time earlier). I implemented it for GMP, using Cooley-Tukey
for cache locality, and van der Hoeven's truncated fft, together with
varied number of primes (3--5), to reduce the staircase-shape of the
running time vs size graph. I designed an invariance interface, and
wrote an FFT-aware variant of matrix22_mul that used invariance. The
new FFT code is not currently in GMP, for those with access to
shell.gmplib.org its in the ~hg/gmp-proj/gcd-nisse respository.

Anyway, at the end of this period I wrote a TODO-GCD file, included
below. I think some minor items have been addressed since then, but
most of them remains.

Slides from a presentation a made at the department at the end can
also be found at
http://www.lysator.liu.se/~nisse/archive/gcd-seminar-2008-10-30.pdf

Regards,
/Niels

* Nail support (mainly affects hgcd2). The previous subquadratic code
  got it right.

* Optimization of hgcd2 and div2 (hgcd2.c).

  Some ideas:

  * Maintain operands left-normalized (most significant bit set).

  * Table lookups of either a single quotient, or an entire small
    matrix.

  * In hgcd2, use single precision operations after half of the work
    is done (depends on normalization).

  * Assembler implementation (after all other ideas are exhausted).

* Small-matrix application:

  * mpn_addaddmul_1msb0 for more architectures.

  * Write an mpn_addsubmul_1.

  * Write an hgcd4, to take advantage of mpn_*mul_2.

* gcdext:

  * In compute_v, use divexact. Or use mullo and bdiv_q do do the
    entire computation (a u - g) b^{-1} mod 2^{n * GMP_NUMB_BITS}.
    Export function, so it can be used also by mpz_gcdext.

  * Improve the choice of p, the splitting in the gcdext outer loop.

  * Test and document mpn_gcdext_1_u.

* hgcd:

  * Try the alternative interface

      mp_size_t
      mpn_hgcd (mp_srcptr ap, mp_srcptr bp, mp_size_t n,
		struct hgcd_matrix *M, mp_ptr tp);

    where ap and bp are not modified, and only the matrix is
    returned. Exploit cancellation in the product M^{-1} (a;b), using
    FFT wrap-around. Can be expected to be useful only above some
    threshold.
    
  * Use FFT invariance for the matrix-vector multiplication
    mpn_hgcd_matrix_adjust. This function is used for varying
    balancedness of the inputs, depending also on the choice of p in
    the gcd and gcdext outer loops.

* spfft

  * Derive better bounds for the transform size.

  * Test and improve mpn_sreconstruct.

  * Johan Håstad suggested to split inputs using signed numbers in the
    range (-m/2, m/2), deterministically or randomly. Reduces the
    size of the coefficients of the product by a bit or two in the
    worst case, and much more in the "average" case.

  * For Cooley-Tukey, consider higher levels of memory hierarchy, such
    as L2 cache and physical memory.

  * Further optimization of assembler loops, in particular
    fourier_inverse_base. Implement for other architectures.

  * Review the fourier interface, in particular with respect to
    wrap-around calculations, and negative polynomial coefficients
    (e.g., to compute a b - c d).

  * See if the truncated transform is useful as a complement to
    wraparound, when parts of the product is known a priori. Extend
    fourier_inverse to take known but non-zero inputs.

* Related general issues:

  * Unbalanced multiplication.

  * Mullo. Consider efficient computation of a b - c d, with
    cancellation, for sizes below the FFT range.


More information about the gmp-devel mailing list