# gcdext_1 questions

Torbjorn Granlund tg at gmplib.org
Thu Dec 3 10:34:29 CET 2009

```nisse at lysator.liu.se (Niels Möller) writes:

I'm not sure what interface one should have for mpn_gcdext_1. There are
two possible variants.

If we decide to make public a function "mpn_gcdext_1" than it should
probably take {p,n} and a limb as input arguments.  If it takes just two
limbs, the _1 scheme does not apply.

But that's not actually your question, I suppose.

The first, lets call it mpn_gcdext_1_s, takes one bignum and one limb as
inputs, and computes the gcd and the first cofactor.

/* Returns G = gcd(U, V), and a cofactor S such that G = U S (mod
V). It is required that U >= V > 0, and that the most significant
limb of U is non-zero. The returned S satisfies the following
requirements:

S = 0 iff V divides U

If G = V/2, then S = 1,
otherwise, 2 |S| < V / G

These bounds (see Them. 4.3 in "A Computational Introduction to
Number Theory and Algebra", Victor Shoup, book
http://www.shoup.net/ntb/) imply that S always fits in a signed
limb.
*/
mp_limb_t
mpn_gcdext_1_s (mp_limb_signed_t *sp,
mp_srcptr up, mp_size_t un, mp_limb_t v);

The second variant, lets call it mpn_gcdext_1_st, takes two limbs as
inputs and computes gcd and both cofactors,

mp_limb_t
mpn_gcdext_1_st (mp_limb_signed_t *sp, mp_limb_signed_t *tp,
mp_limb_t u, mp_limb_t u)

OK, so two things change, the input type of the u argument and whether
the 2nd cofactor is generated.

Both can be implemented using Euclid's algorithm; then the only
difference is that the _s variant has an initial mod_1 call, and it
omits all updates of the second cofactor (which for large u won't fit in
a single limb anyway).

But the 2nd cofactor arg could be made large, of course?  (I assume it
would be non-trivial to avoid quadratic complexity if the 2nd cofactor
where iteratively updated, therefore it will be computed at the end.)

But it's also possible to implement the _st variant on top of the _s
variant, by computing t as (g - s u) / v. Would that be faster or
slower? Can one exploit that this is an exact division? When v is odd,
the computation of v can be done as

's/computation of v/division by v/'?

binvert_limb (vinv, v);
*tp = (g - s * u) * vinv;

(not 100% sure this gets the signs right; if not one would have to test
for the sign of s and do two cases). For a 64-bit machine, that's 8
multiplications, while updating the cofactor as a part of Euclid's
algorithm needs one extra multiplication per iteration.

Of the 8 multiplications, 7 are dependent, and this will be done when
there is probably little other work for the processor.

The simultaneous cofactor updating will typically require more
multiplications (at least of the typical case is random full limbs, a
questionable assuption).  And while they are also "dependent", they are
spread out in time between other operations.

And in case v is even, one could either compute (g - s u) as a double
word number, ...

It is not immediately evident that this is a double-word number.

... and then divide by v using a shift followed by
multiplication by the inverse of the odd part of v. Or one could do some
extra initial processing to take out common powers of two and if needed
swap u,v and sp, tp.

And make sure such initial processing does not non-canonicalise the
resulting cofactors...

Also for the Euclid code (no matter if it computes one or both
cofactors), one should most likely use the div1 macro os similar to
compute each quotient (which with high probability is small) using shift
and subtraction. And then *maybe* one shouldn't multiply anything by q,
but instead use shifts and adds as each bit of q is generated.

The divisions are the problem, let's not worry about multiplications by
q!  (For divisions, please use div1 for now.  At some point, we could
use a sliding window in u and v of, say, 5 and 3 bits respectively.
These bits are then used as an index into a division table with small
quotients.  Of course, the quotients can be too large (or of one so
chooses, too small), so some branch-free adjustments will be needed.
Instead of using a byte table, one could index with the u window bits
and extract a bit field with the v window bits.  That would allow for a
more compact table (512 bits for the 5,3 but example), and might also be
faster.)

But now we have left interface issues far behind, geting into micro
optimizations...

Which is much more fun!  :-)

And then we also have the question, if we can compute canonical
cofactors using the binary algorithm, which would be particularly
attractive if the binary algorithm can be expressed in a branch-free
fashion, like the GCD_1_METHOD == 2 code in gcd_1.c.

Was the question "can the binary algorithm be coerced into computing the
canonical cofactors?"?

--
Torbjörn
```