Possible gcd/hgcd/gcdext improvements

Fri Aug 26 21:41:44 CEST 2011

I've ben thinking a bit about some possible improvements to
gcd/hgcd/gcdext. I don't think I will have time to finish any of it
anytime soon, so I'll try to write the ideas down.

1. New divide and conquer organization of gcdext. See message
   <nnvcxhh4gq.fsf at stalhein.lysator.liu.se>,
   http://gmplib.org/list-archives/gmp-devel/2011-May/001942.html.

Diversion: HGCD review

   Let's first review what hgcd does. It takes two inputs A, B of size
   n. Let s = floor(n/2) + 1. It computes a reduction matrix M
   (non-negative integer values, invertible, determinant 1) and
   corresponding reduced numbers a, b, such that

      (a;b) = M^{-1} (A;B)

      a,b are larger than s limbs.

      |a-b| is s limbs or smaller.

   Paul (and maybe others) have suggested, and it has been discussed
   here earlier, using a variant of hgcd which returns the matrix only.
   To make it possible to compute that with less effort, one needs to
   relax the requirements a bit, and allow that |a-b| is slightly
   larger than s limbs. "Slightly" might mean a limit at a fix number
   of bits, or some O(log n)) bits would also be ok. Let's call such a
   function hgcd_appr.

   why is that useful? hgcd is often applied to the most significant
   half (or third or some other fraction) of some larger inputs, and
   then the matrix M is used to reduce the larger numbers. Say we have A
   = alpha 2^k + A', B = beta 2^k + B', and apply hgcd to (alpha, beta)
   producing a, b, M. Then we need to compute

     (a;b) 2^k + M^{-1} (A'; B')

   Without having (a, b) already available, we would instead need to
   compute

     M^{-1} (A; B)

   This looks like it's much more expensive, since A, B are larger than
   A', B'. But it doesn't have to be. The trick is that there's known
   cancellation, so one can compute it efficiently mod 2^m or mod 2^m -
   1 where m is a bound (known apriori, by examining the size of the
   entries of M) on the size of the result.

2. I think such an hgcd_appr may be useful also in the Lehmer range.
   Lehmer repeatedly chops off the two most significant limbs, applies
   hgcd (using the specialized function hgcd2). Then the matrix from
   hgcd2, with single-limb numbers, is used to reduce the full numbers,
   and for building up a large matrix to be returned. I have been
   thinking that with this algorithm, we get the a, b to return for
   free.

   I just realized that one can probably make a more efficient hgcd_appr
   also in this case, by gradually dropping low limbs from the
   computation. E.g., say hgcd inputs are of 20 limbs, hence the size of
   both the reduced numbers and the matrix elements will be roughly 10
   limbs. After a few iterations, we have reduced the inputs to, say, 12
   limbs and we have 2 left to go. Then we shouldn't need more than about 5
   limbs of precision, and updating the lowest 7 limbs seems mostly
   useless.

   In fact, hgcd2 already does something similar, switching from double
   to single precision when the numbers have been reduced from
   2*GMP_LIMB_BITS bits to 3/2 GMP_LIMB_MAX, which will sometimes
   produce a matrix for which |a-b| is a bit or two too large (I don't
   recall the details of that analysis).

3. Use of mulmid. Say we implement a recursive hgcd_appr as follows.
   Inputs are A, B, of n limbs each.

     1. M <-- hgcd_appr(top n/2 limbs of A, B)

     2. (A; B) <-- M^{-1} (A;B)

     3. Small number of Lehmer or Euclid steps, needed to guarantee a
        size below 3n/4.

     4. M' <-- hgcd(top n/2 limbs of A, B)

     5. Return M * M'

   Forget step 3 now (I hope it can be taken care of in some way, but
   there may be a few different cases to think about). In step 2, I have
   already mentioned cancellation. In the common case, the individual
   multiplies making up the matrix-vector product are of size n/4 × n,
   but the top n/2 limbs cancel out, and this lets us use modular
   arithmetic where the modulo is any convenient number larger than
   about 3 n/4.

   But now look at step 4: The next thing we do is to discard the lowest
   n/4 limbs. So we have full products of size 5 n/4, but we can ignore
   the highest n/2 limbs as well as the lowest n/4, keeping only the n/2
   limbs in the middle. I'm not sure ignoring the low limbs can be
   exploited using modular arithmetic/wraparound arithmetic, but I think
   an (unbalanced) n/4 x 3n/4 mulmid should almost do the job (and if it
   turns out the mulmid boundaries are a limb or two off from what we
   really need, that can hopefully be worked around). Which I imagine
   will be particularly attractive in the Karatsuba range.

As always, comments and ideas appreciated.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.