fast gcd
Niels Möller
nisse at lysator.liu.se
Tue Oct 13 21:58:38 CEST 2009
Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:
>> (If anybody wants to see the full gcd todo file, let me know).
>
> I am interested.
Ok. First some context: Last autumn I spent two months at the computer
science department at KTH, sharing a room with Torbjörns and another
PhD student, and hacking on GMP. What I did during that time:
Implemented gcdext on top of the new hgcd algorithm (HGCD-D in the
Math. Comp paper). Integrated the new gcd and gcdext code in GMP
(replacing my previous and much more complicated Schönhage GCD which I
don't think made it into any GMP release).
Next, I looked into small-prime fft (which I had been hacking on from
time to time earlier). I implemented it for GMP, using Cooley-Tukey
for cache locality, and van der Hoeven's truncated fft, together with
varied number of primes (3--5), to reduce the staircase-shape of the
running time vs size graph. I designed an invariance interface, and
wrote an FFT-aware variant of matrix22_mul that used invariance. The
new FFT code is not currently in GMP, for those with access to
shell.gmplib.org its in the ~hg/gmp-proj/gcd-nisse respository.
Anyway, at the end of this period I wrote a TODO-GCD file, included
below. I think some minor items have been addressed since then, but
most of them remains.
Slides from a presentation a made at the department at the end can
also be found at
http://www.lysator.liu.se/~nisse/archive/gcd-seminar-2008-10-30.pdf
Regards,
/Niels
* Nail support (mainly affects hgcd2). The previous subquadratic code
got it right.
* Optimization of hgcd2 and div2 (hgcd2.c).
Some ideas:
* Maintain operands left-normalized (most significant bit set).
* Table lookups of either a single quotient, or an entire small
matrix.
* In hgcd2, use single precision operations after half of the work
is done (depends on normalization).
* Assembler implementation (after all other ideas are exhausted).
* Small-matrix application:
* mpn_addaddmul_1msb0 for more architectures.
* Write an mpn_addsubmul_1.
* Write an hgcd4, to take advantage of mpn_*mul_2.
* gcdext:
* In compute_v, use divexact. Or use mullo and bdiv_q do do the
entire computation (a u - g) b^{-1} mod 2^{n * GMP_NUMB_BITS}.
Export function, so it can be used also by mpz_gcdext.
* Improve the choice of p, the splitting in the gcdext outer loop.
* Test and document mpn_gcdext_1_u.
* hgcd:
* Try the alternative interface
mp_size_t
mpn_hgcd (mp_srcptr ap, mp_srcptr bp, mp_size_t n,
struct hgcd_matrix *M, mp_ptr tp);
where ap and bp are not modified, and only the matrix is
returned. Exploit cancellation in the product M^{-1} (a;b), using
FFT wrap-around. Can be expected to be useful only above some
threshold.
* Use FFT invariance for the matrix-vector multiplication
mpn_hgcd_matrix_adjust. This function is used for varying
balancedness of the inputs, depending also on the choice of p in
the gcd and gcdext outer loops.
* spfft
* Derive better bounds for the transform size.
* Test and improve mpn_sreconstruct.
* Johan Håstad suggested to split inputs using signed numbers in the
range (-m/2, m/2), deterministically or randomly. Reduces the
size of the coefficients of the product by a bit or two in the
worst case, and much more in the "average" case.
* For Cooley-Tukey, consider higher levels of memory hierarchy, such
as L2 cache and physical memory.
* Further optimization of assembler loops, in particular
fourier_inverse_base. Implement for other architectures.
* Review the fourier interface, in particular with respect to
wrap-around calculations, and negative polynomial coefficients
(e.g., to compute a b - c d).
* See if the truncated transform is useful as a complement to
wraparound, when parts of the product is known a priori. Extend
fourier_inverse to take known but non-zero inputs.
* Related general issues:
* Unbalanced multiplication.
* Mullo. Consider efficient computation of a b - c d, with
cancellation, for sizes below the FFT range.
More information about the gmp-devel
mailing list