Small gcdext_1
Torbjörn Granlund
tg at gmplib.org
Wed Oct 9 11:56:39 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
For small gcdext (one or two limbs), it seems likely that a branch-free
binary algorithm is a good choice. The binary algorithm in
mpn/generic/gcdext_1.c seems to work now, after a recent bugfix, but
it's not that fast. One reason is that the loop updates both cofactors,
which costs both instructions and registers.
If we do a gcdext computing only one cofactor, I'd like to think about
it as a modular inversion function. For u, v, with v odd, attempt to compute
u^{-1} (mod v). More precisely, always compute gcd(u,v). In addition,
compute a cofactor s according to these rules:
A modular inersion function also when the inverse does not exist. :-)
If u = 0 (mod v), equivalently, gcd (u,v) = v, set s = 0
If gcd (u,v) = 1, set s = u^{-1} (mod v)
If gcd (u,v) > 1, set s = (u/g)^{-1} (mod v/g)
Below are three implementations of such a function, one based on
euclid's algorithm, one plain binary, and one binary using same masking
tricks as in gcdext_1.c.
You've been busy!
I believe two things will make the final, branch-reduced variant slow.
1. The explicit division, and 2. the shift loop.
It is not clear that postponing the work done in the shift loop from the
main loopis a good idea (even if it is done with a faster bmod op). For
plain gcd_11 we typically are latency limited. Perhaps some indepndent
work such as shifting s0 could be done for free in gcdext_11?
Using the main loop might also reduce register pressure.
PS. Why is the sign logic for the cofactor sign different in the two
binary method implementations?
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list