# Small gcdext_1

Torbjörn Granlund tg at gmplib.org
Wed Oct 9 11:56:39 UTC 2019

```nisse at lysator.liu.se (Niels Möller) writes:

For small gcdext (one or two limbs), it seems likely that a branch-free
binary algorithm is a good choice. The binary algorithm in
mpn/generic/gcdext_1.c seems to work now, after a recent bugfix, but
it's not that fast. One reason is that the loop updates both cofactors,
which costs both instructions and registers.

If we do a gcdext computing only one cofactor, I'd like to think about
it as a modular inversion function. For u, v, with v odd, attempt to compute
u^{-1} (mod v). More precisely, always compute gcd(u,v). In addition,
compute a cofactor s according to these rules:

A modular inersion function also when the inverse does not exist.  :-)

If u = 0 (mod v), equivalently, gcd (u,v) = v, set s = 0

If gcd (u,v) = 1, set s = u^{-1} (mod v)

If gcd (u,v) > 1, set s = (u/g)^{-1} (mod v/g)

Below are three implementations of such a function, one based on
euclid's algorithm, one plain binary, and one binary using same masking
tricks as in gcdext_1.c.

You've been busy!

I believe two things will make the final, branch-reduced variant slow.
1. The explicit division, and 2. the shift loop.

It is not clear that postponing the work done in the shift loop from the
main loopis a good idea (even if it is done with a faster bmod op).  For
plain gcd_11 we typically are latency limited.  Perhaps some indepndent
work such as shifting s0 could be done for free in gcdext_11?

Using the main loop might also reduce register pressure.

PS. Why is the sign logic for the cofactor sign different in the two
binary method implementations?

--
Torbjörn