Small gcdext_1

Torbjörn Granlund tg at
Wed Oct 9 11:56:39 UTC 2019

nisse at (Niels Möller) writes:

  For small gcdext (one or two limbs), it seems likely that a branch-free
  binary algorithm is a good choice. The binary algorithm in
  mpn/generic/gcdext_1.c seems to work now, after a recent bugfix, but
  it's not that fast. One reason is that the loop updates both cofactors,
  which costs both instructions and registers.

  If we do a gcdext computing only one cofactor, I'd like to think about
  it as a modular inversion function. For u, v, with v odd, attempt to compute
  u^{-1} (mod v). More precisely, always compute gcd(u,v). In addition,
  compute a cofactor s according to these rules:

A modular inersion function also when the inverse does not exist.  :-)

  If u = 0 (mod v), equivalently, gcd (u,v) = v, set s = 0

  If gcd (u,v) = 1, set s = u^{-1} (mod v)

  If gcd (u,v) > 1, set s = (u/g)^{-1} (mod v/g)

  Below are three implementations of such a function, one based on
  euclid's algorithm, one plain binary, and one binary using same masking
  tricks as in gcdext_1.c.

You've been busy!

I believe two things will make the final, branch-reduced variant slow.
1. The explicit division, and 2. the shift loop.

It is not clear that postponing the work done in the shift loop from the
main loopis a good idea (even if it is done with a faster bmod op).  For
plain gcd_11 we typically are latency limited.  Perhaps some indepndent
work such as shifting s0 could be done for free in gcdext_11?

Using the main loop might also reduce register pressure.

PS. Why is the sign logic for the cofactor sign different in the two
binary method implementations?

Please encrypt, key id 0xC8601622

More information about the gmp-devel mailing list