nisse at lysator.liu.se
Wed Oct 9 12:31:56 UTC 2019
tg at gmplib.org (Torbjörn Granlund) writes:
> I believe two things will make the final, branch-reduced variant slow.
> 1. The explicit division, and 2. the shift loop.
It turns out the explicit division is easily eliminated, v_div_g = s0 +
s1 at the end of the binary loop (when u == v). And in the euclid loop,
u == 0 implies v_div_g = s0.
> It is not clear that postponing the work done in the shift loop from the
> main loopis a good idea (even if it is done with a faster bmod op). For
> plain gcd_11 we typically are latency limited. Perhaps some indepndent
> work such as shifting s0 could be done for free in gcdext_11?
Maybe. I see two complications:
* The shift step needs to add v_div_g (when s0 is odd), and that's not
known until end of the main loop. One could add the current v instead,
but I'm afraid one might end up with a too large s0. Since the
function aims to produce a unique s in the range 0 <= s < input_v / g.
* A single iteration of the main loop may need to shift more than one
place. To avoid a nested loop, one would need a binv reduction in the
loop. One could do it with only a small inverse, with a different code
path for the unlikely case of a large shift.
> PS. Why is the sign logic for the cofactor sign different in the two
> binary method implementations?
Not sure exactly what difference you're referring to. But one source of
difference is that the second variant swaps u and v, and each swap flips
the the sign of the determinant of the corresponding matrix. While the
first variant never swaps anything.
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel