# Small operands gcd improvements

Torbjörn Granlund tg at gmplib.org
Wed Jul 31 10:58:26 UTC 2019

```nisse at lysator.liu.se (Niels Möller) writes:

We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are
non-zero. After computing {t1, t0} <-- {v1,v0} - {u1,u0}, there are a
couple of cases:

My main perspective wasn't algorithmic but about code organisation.
"Were to do what."

Let's compare what you suggest algorithm wise compared to what I
implemented.

My code stops if either u0-v0 or u1-v1-cy becomes zero (where cy is
assigned from the first subtraction).

If u0-v0 becomes zero, either u0 or v0 will become zero in the next
iteration, of course.  My code then bails out as it cannot handle a
rightshift of a full limb.  The bailout state is essentially a gcd_21
situation.  So your suggested criteria and my implemented criteria seem
very similar here.

Checking u1 = 0 and v1 = 0 separately as you suggest is a different
thing, and it might not have zero cost in the gcd_22 loop.  If many
iterations are expected before one has u1 = v1 = 0, then this is worth a
separate loop.  (Or perhaps a hardware 2/1 division, but its quotient
does not necessarily fit in a single limb!)

The gcd_21 loop takes inputs (u1, u0, v), inputs odd, and all the limbs
non-zero. It is a much simpler loop than gcd_22

We could do a large rightshift outside the loop and then jump back into
and (ab)use gcd_22 with u1 = 0 xor v1 = 0.  I suspect random operands
will not see any timing difference.  Don't you agree that u1 / v1 will
not be too far from 1.0 on average?

Let's say that gcd(u,v) requires an odd u. What are the options for v?
If we allow arbitrary v, then it could in principle be implemented as
the tail recursive

mp_limb_t
gcd_11 (mp_limb_t u, mp_limb_t v)
{
if (v == 0) return u;
v >>= count;
return gcd_11(min(u,v), |u-v|);
}

Cute.  But I think we should reuse the fine tuned loops of our many
gcd_1.asm.  Perhaps we could arrange to enter these loops at a point
where we allow for one operand being even.  But that would cause a very
slight delay in getting things going in the common case of two odd
operands (as you mention).

So I see no strong reason to do it one way or the other. Maybe start
with requiring both u and v odd, and relax requirements later in case we
find any more tangible benefit from doing that?

Makes sense.

--
Torbjörn