Small gcdext_1

Wed Oct 9 12:31:56 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> I believe two things will make the final, branch-reduced variant slow.
> 1. The explicit division, and 2. the shift loop.

It turns out the explicit division is easily eliminated, v_div_g = s0 +
s1 at the end of the binary loop (when u == v). And in the euclid loop,
u == 0 implies v_div_g = s0.

> It is not clear that postponing the work done in the shift loop from the
> main loopis a good idea (even if it is done with a faster bmod op).  For
> plain gcd_11 we typically are latency limited.  Perhaps some indepndent
> work such as shifting s0 could be done for free in gcdext_11?

Maybe. I see two complications:

* The shift step needs to add v_div_g (when s0 is odd), and that's not
  known until end of the main loop. One could add the current v instead,
  but I'm afraid one might end up with a too large s0. Since the
  function aims to produce a unique s in the range 0 <= s < input_v / g.

* A single iteration of the main loop may need to shift more than one
  place. To avoid a nested loop, one would need a binv reduction in the
  loop. One could do it with only a small inverse, with a different code
  path for the unlikely case of a large shift.

> PS. Why is the sign logic for the cofactor sign different in the two
> binary method implementations?

Not sure exactly what difference you're referring to. But one source of
difference is that the second variant swaps u and v, and each swap flips
the the sign of the determinant of the corresponding matrix. While the
first variant never swaps anything.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.