Small gcdext_1

Torbjörn Granlund tg at gmplib.org
Wed Oct 9 11:56:39 UTC 2019


nisse at lysator.liu.se (Niels Möller) writes:

  For small gcdext (one or two limbs), it seems likely that a branch-free
  binary algorithm is a good choice. The binary algorithm in
  mpn/generic/gcdext_1.c seems to work now, after a recent bugfix, but
  it's not that fast. One reason is that the loop updates both cofactors,
  which costs both instructions and registers.

  If we do a gcdext computing only one cofactor, I'd like to think about
  it as a modular inversion function. For u, v, with v odd, attempt to compute
  u^{-1} (mod v). More precisely, always compute gcd(u,v). In addition,
  compute a cofactor s according to these rules:

A modular inersion function also when the inverse does not exist.  :-)

  If u = 0 (mod v), equivalently, gcd (u,v) = v, set s = 0

  If gcd (u,v) = 1, set s = u^{-1} (mod v)

  If gcd (u,v) > 1, set s = (u/g)^{-1} (mod v/g)

  Below are three implementations of such a function, one based on
  euclid's algorithm, one plain binary, and one binary using same masking
  tricks as in gcdext_1.c.

You've been busy!

I believe two things will make the final, branch-reduced variant slow.
1. The explicit division, and 2. the shift loop.

It is not clear that postponing the work done in the shift loop from the
main loopis a good idea (even if it is done with a faster bmod op).  For
plain gcd_11 we typically are latency limited.  Perhaps some indepndent
work such as shifting s0 could be done for free in gcdext_11?

Using the main loop might also reduce register pressure.

PS. Why is the sign logic for the cofactor sign different in the two
binary method implementations?

-- 
Torbjörn
Please encrypt, key id 0xC8601622


More information about the gmp-devel mailing list