# Small operands gcd improvements

Torbjörn Granlund tg at gmplib.org
Mon Aug 5 12:17:14 UTC 2019

```nisse at lysator.liu.se (Niels Möller) writes:

> Checking u1 = 0 and v1 = 0 separately as you suggest is a different
> thing, and it might not have zero cost in the gcd_22 loop.

I think only the shifted number should be checked, and as the main loop
exit condition.

OK, so we agree there!

If both u1 > 0 and v1 > 0 on entry (if gcd_22 require that is
unclear), then u1 == 0 impliex v1 > 0.

I don't think gcd_22 should misbehave for v1 = u1 = 0.  It currently is
designed to exit for such operands in order to leave things for gcd_11.

But wait, I think my current code might fail for u1 = 0, u0 = small, v1
= v0 = 0xfff.fff.  That would result in the upper half difference being
0 with carry out.  The zero will trigger loop exit currently.  Fixable,
but we should make sure to test for that (at least if u1 = 0 is
allowed).

> We could do a large rightshift outside the loop and then jump back into
> and (ab)use gcd_22 with u1 = 0 xor v1 = 0.  I suspect random operands
> will not see any timing difference.  Don't you agree that u1 / v1 will
> not be too far from 1.0 on average?

That would probably work fine, but maybe not simpler than an explicit
gcd_22. So then the normal case would call gcd_22 with u1 > 0, v1 > 0,
and exit the loop with u1 > 0, v1 == 0. I don't see why we'd need a
right shift here, I think one could just check if u1 > 0 and if so jump
back to gcd_22, and next time the loop exits we will have u1 = 0, v1 =
0.

What do you mean by an "explicit gcd_22"?

What I am saying is that we should not design and validate more loops
that justified.  While gcd_21 would be sleeker than gcd_22, it might not
speed up non-contrived examples as after gcd_22 exits because it has too
(the case v0 = 0 or u0 = 0) we have an unlikely scenario.  If we let it
exit if v1 = 0 *or* u1 = 0, as opposed to v1 = 0 and u1 = 0, then we
might need gcd_21 in some form, but it will rarely run for more than 1
or two iterations (not enough to train its loop branch to be rightly
predicted).

I think an explicit gcd_21 loop may be clearer, but unlikely to matter
for performance.

I'd like to avoid another asm loop if possible.

--
Torbjörn