Possible gcd/hgcd/gcdext improvements
Niels Möller
nisse at lysator.liu.se
Fri Aug 26 21:41:44 CEST 2011
I've ben thinking a bit about some possible improvements to
gcd/hgcd/gcdext. I don't think I will have time to finish any of it
anytime soon, so I'll try to write the ideas down.
1. New divide and conquer organization of gcdext. See message
<nnvcxhh4gq.fsf at stalhein.lysator.liu.se>,
http://gmplib.org/list-archives/gmp-devel/2011-May/001942.html.
Diversion: HGCD review
Let's first review what hgcd does. It takes two inputs A, B of size
n. Let s = floor(n/2) + 1. It computes a reduction matrix M
(non-negative integer values, invertible, determinant 1) and
corresponding reduced numbers a, b, such that
(a;b) = M^{-1} (A;B)
a,b are larger than s limbs.
|a-b| is s limbs or smaller.
Paul (and maybe others) have suggested, and it has been discussed
here earlier, using a variant of hgcd which returns the matrix only.
To make it possible to compute that with less effort, one needs to
relax the requirements a bit, and allow that |a-b| is slightly
larger than s limbs. "Slightly" might mean a limit at a fix number
of bits, or some O(log n)) bits would also be ok. Let's call such a
function hgcd_appr.
why is that useful? hgcd is often applied to the most significant
half (or third or some other fraction) of some larger inputs, and
then the matrix M is used to reduce the larger numbers. Say we have A
= alpha 2^k + A', B = beta 2^k + B', and apply hgcd to (alpha, beta)
producing a, b, M. Then we need to compute
(a;b) 2^k + M^{-1} (A'; B')
Without having (a, b) already available, we would instead need to
compute
M^{-1} (A; B)
This looks like it's much more expensive, since A, B are larger than
A', B'. But it doesn't have to be. The trick is that there's known
cancellation, so one can compute it efficiently mod 2^m or mod 2^m -
1 where m is a bound (known apriori, by examining the size of the
entries of M) on the size of the result.
2. I think such an hgcd_appr may be useful also in the Lehmer range.
Lehmer repeatedly chops off the two most significant limbs, applies
hgcd (using the specialized function hgcd2). Then the matrix from
hgcd2, with single-limb numbers, is used to reduce the full numbers,
and for building up a large matrix to be returned. I have been
thinking that with this algorithm, we get the a, b to return for
free.
I just realized that one can probably make a more efficient hgcd_appr
also in this case, by gradually dropping low limbs from the
computation. E.g., say hgcd inputs are of 20 limbs, hence the size of
both the reduced numbers and the matrix elements will be roughly 10
limbs. After a few iterations, we have reduced the inputs to, say, 12
limbs and we have 2 left to go. Then we shouldn't need more than about 5
limbs of precision, and updating the lowest 7 limbs seems mostly
useless.
In fact, hgcd2 already does something similar, switching from double
to single precision when the numbers have been reduced from
2*GMP_LIMB_BITS bits to 3/2 GMP_LIMB_MAX, which will sometimes
produce a matrix for which |a-b| is a bit or two too large (I don't
recall the details of that analysis).
3. Use of mulmid. Say we implement a recursive hgcd_appr as follows.
Inputs are A, B, of n limbs each.
1. M <-- hgcd_appr(top n/2 limbs of A, B)
2. (A; B) <-- M^{-1} (A;B)
3. Small number of Lehmer or Euclid steps, needed to guarantee a
size below 3n/4.
4. M' <-- hgcd(top n/2 limbs of A, B)
5. Return M * M'
Forget step 3 now (I hope it can be taken care of in some way, but
there may be a few different cases to think about). In step 2, I have
already mentioned cancellation. In the common case, the individual
multiplies making up the matrix-vector product are of size n/4 × n,
but the top n/2 limbs cancel out, and this lets us use modular
arithmetic where the modulo is any convenient number larger than
about 3 n/4.
But now look at step 4: The next thing we do is to discard the lowest
n/4 limbs. So we have full products of size 5 n/4, but we can ignore
the highest n/2 limbs as well as the lowest n/4, keeping only the n/2
limbs in the middle. I'm not sure ignoring the low limbs can be
exploited using modular arithmetic/wraparound arithmetic, but I think
an (unbalanced) n/4 x 3n/4 mulmid should almost do the job (and if it
turns out the mulmid boundaries are a limb or two off from what we
really need, that can hopefully be worked around). Which I imagine
will be particularly attractive in the Karatsuba range.
As always, comments and ideas appreciated.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list