Small operands gcd improvements

Mon Aug 5 12:17:14 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  > Checking u1 = 0 and v1 = 0 separately as you suggest is a different
  > thing, and it might not have zero cost in the gcd_22 loop.

  I think only the shifted number should be checked, and as the main loop
  exit condition.

OK, so we agree there!

  If both u1 > 0 and v1 > 0 on entry (if gcd_22 require that is
  unclear), then u1 == 0 impliex v1 > 0.

I don't think gcd_22 should misbehave for v1 = u1 = 0.  It currently is
designed to exit for such operands in order to leave things for gcd_11.

But wait, I think my current code might fail for u1 = 0, u0 = small, v1
= v0 = 0xfff.fff.  That would result in the upper half difference being
0 with carry out.  The zero will trigger loop exit currently.  Fixable,
but we should make sure to test for that (at least if u1 = 0 is
allowed).

  > We could do a large rightshift outside the loop and then jump back into
  > and (ab)use gcd_22 with u1 = 0 xor v1 = 0.  I suspect random operands
  > will not see any timing difference.  Don't you agree that u1 / v1 will
  > not be too far from 1.0 on average?

  That would probably work fine, but maybe not simpler than an explicit
  gcd_22. So then the normal case would call gcd_22 with u1 > 0, v1 > 0,
  and exit the loop with u1 > 0, v1 == 0. I don't see why we'd need a
  right shift here, I think one could just check if u1 > 0 and if so jump
  back to gcd_22, and next time the loop exits we will have u1 = 0, v1 =
  0.

What do you mean by an "explicit gcd_22"?

What I am saying is that we should not design and validate more loops
that justified.  While gcd_21 would be sleeker than gcd_22, it might not
speed up non-contrived examples as after gcd_22 exits because it has too
(the case v0 = 0 or u0 = 0) we have an unlikely scenario.  If we let it
exit if v1 = 0 *or* u1 = 0, as opposed to v1 = 0 and u1 = 0, then we
might need gcd_21 in some form, but it will rarely run for more than 1
or two iterations (not enough to train its loop branch to be rightly
predicted).

  I think an explicit gcd_21 loop may be clearer, but unlikely to matter
  for performance.

I'd like to avoid another asm loop if possible.

-- 
Torbjörn
Please encrypt, key id 0xC8601622