gcdext_1 questions

Thu Dec 3 10:34:29 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

  I'm not sure what interface one should have for mpn_gcdext_1. There are
  two possible variants.

If we decide to make public a function "mpn_gcdext_1" than it should
probably take {p,n} and a limb as input arguments.  If it takes just two
limbs, the _1 scheme does not apply.

But that's not actually your question, I suppose.

  The first, lets call it mpn_gcdext_1_s, takes one bignum and one limb as
  inputs, and computes the gcd and the first cofactor.

    /* Returns G = gcd(U, V), and a cofactor S such that G = U S (mod
       V). It is required that U >= V > 0, and that the most significant
       limb of U is non-zero. The returned S satisfies the following
       requirements:

         S = 0 iff V divides U

         If G = V/2, then S = 1,
         otherwise, 2 |S| < V / G

       These bounds (see Them. 4.3 in "A Computational Introduction to
       Number Theory and Algebra", Victor Shoup, book
       http://www.shoup.net/ntb/) imply that S always fits in a signed
       limb.
    */
    mp_limb_t
    mpn_gcdext_1_s (mp_limb_signed_t *sp,
                    mp_srcptr up, mp_size_t un, mp_limb_t v);

  The second variant, lets call it mpn_gcdext_1_st, takes two limbs as
  inputs and computes gcd and both cofactors,

    mp_limb_t
    mpn_gcdext_1_st (mp_limb_signed_t *sp, mp_limb_signed_t *tp,
                     mp_limb_t u, mp_limb_t u)

OK, so two things change, the input type of the u argument and whether
the 2nd cofactor is generated.

  Both can be implemented using Euclid's algorithm; then the only
  difference is that the _s variant has an initial mod_1 call, and it
  omits all updates of the second cofactor (which for large u won't fit in
  a single limb anyway).

But the 2nd cofactor arg could be made large, of course?  (I assume it
would be non-trivial to avoid quadratic complexity if the 2nd cofactor
where iteratively updated, therefore it will be computed at the end.)

  But it's also possible to implement the _st variant on top of the _s
  variant, by computing t as (g - s u) / v. Would that be faster or
  slower? Can one exploit that this is an exact division? When v is odd,
  the computation of v can be done as

's/computation of v/division by v/'?

    binvert_limb (vinv, v);
    *tp = (g - s * u) * vinv;

  (not 100% sure this gets the signs right; if not one would have to test
  for the sign of s and do two cases). For a 64-bit machine, that's 8
  multiplications, while updating the cofactor as a part of Euclid's
  algorithm needs one extra multiplication per iteration.

Of the 8 multiplications, 7 are dependent, and this will be done when
there is probably little other work for the processor.

The simultaneous cofactor updating will typically require more
multiplications (at least of the typical case is random full limbs, a
questionable assuption).  And while they are also "dependent", they are
spread out in time between other operations.

  And in case v is even, one could either compute (g - s u) as a double
  word number, ...

It is not immediately evident that this is a double-word number.

  ... and then divide by v using a shift followed by
  multiplication by the inverse of the odd part of v. Or one could do some
  extra initial processing to take out common powers of two and if needed
  swap u,v and sp, tp.

And make sure such initial processing does not non-canonicalise the
resulting cofactors...

  Also for the Euclid code (no matter if it computes one or both
  cofactors), one should most likely use the div1 macro os similar to
  compute each quotient (which with high probability is small) using shift
  and subtraction. And then *maybe* one shouldn't multiply anything by q,
  but instead use shifts and adds as each bit of q is generated.

The divisions are the problem, let's not worry about multiplications by
q!  (For divisions, please use div1 for now.  At some point, we could
use a sliding window in u and v of, say, 5 and 3 bits respectively.
These bits are then used as an index into a division table with small
quotients.  Of course, the quotients can be too large (or of one so
chooses, too small), so some branch-free adjustments will be needed.
Instead of using a byte table, one could index with the u window bits
and extract a bit field with the v window bits.  That would allow for a
more compact table (512 bits for the 5,3 but example), and might also be
faster.)

  But now we have left interface issues far behind, geting into micro
  optimizations...

Which is much more fun!  :-)

  And then we also have the question, if we can compute canonical
  cofactors using the binary algorithm, which would be particularly
  attractive if the binary algorithm can be expressed in a branch-free
  fashion, like the GCD_1_METHOD == 2 code in gcd_1.c.

Was the question "can the binary algorithm be coerced into computing the
canonical cofactors?"?

-- 
Torbjörn