Small operands gcd improvements

Wed Jul 31 07:14:15 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> Perhaps gcd_22 should (conditionally) invoke (or tail call) gcd_11,
> possibly it should let its caller invoke gcd_11 to finish up an
> unfinished job.

We should run the gcd_22 main loop as long as all of u1, u0, v1, v0 are
non-zero. After computing {t1, t0} <-- {v1,v0} - {u1,u0}, there are a
couple of cases:

1. t1 == t0 == 0: Terminate. Will happen if and only of the gcd is larger
   than a single limb.

2. t0 == 0. Enter gcd_21 (se below), after some setup to shift out
   trailing zeros from |v1 - u1|.

3. Common case: Get absolute value and shift out trailing zeros. May
   produce a zero high limb, in that case fall through to gcd_21,
   otherwise, continue gcd_22 loop.

(We could check for an early zero high limb between steps (2) and (3),
i.e., check for (t1 == 0 and no underflow or t1 == ~0 and underflow from
the subtraction). But most likely not worth the effort, since these
cases are also taken care of by the high limb check in (3).)

The gcd_21 loop takes inputs (u1, u0, v), inputs odd, and all the limbs
non-zero. It is a much simpler loop than gcd_22

  do
    {
      {u1, u0} <-- {u1, u0} - v   (always > 0)
      if (u0 == 0) { u0 = u1; break; }
      shift out trailing zeros
    }
  while (u1 > 0)

Not sure how many rounds it will typically run, but I'd guess typically
few. When the gcd_21 loop exits, we're set for a call to gcd_11.

>   What should its input requirements be? Should it require one argument,
>   either argument, or both argument to be odd? If even inputs are allowed,
>   do we also allow zero input?
>
>   To me it makes some sense to require that common factors of two are
>   taken care of at top-level, and at least rule out the case of two even
>   inputs to gcd_11.
>
> Agreed.

Let's say that gcd(u,v) requires an odd u. What are the options for v?
If we allow arbitrary v, then it could in principle be implemented as
the tail recursive

mp_limb_t 
gcd_11 (mp_limb_t u, mp_limb_t v) 
{
  if (v == 0) return u;
  count_leading_zeros(cnt, v);
  v >>= count;
  return gcd_11(min(u,v), |u-v|);
}

Or we say odd u, non-zero v, the check for zero is moved around, and
we'll check for zero only in connection with the u-v subtraction,

mp_limb_t 
gcd_11 (mp_limb_t u, mp_limb_t v) 
{
  count_leading_zeros(cnt, v);
  v >>= count;
  if (u == v) return u;
  return gcd_11(min(u,v), |u-v|);
}

If we also require v odd, then we'll just move the clz and shift a bit,
with no really clear benefits. 

Pro: Avoids initial clz for uses where both inputs are known odd
apriori. May be slightly more convenient for the C gcd_11 with masking
tricks where initial clz and theone in the loop are a bit different.

Cons: Moves the responsibility an initial clz out of the supposedly well
optimized asm gcd_11 and out to C code. Call chain down from mpz_gcd
with small inputs is particularly important.

So I see no strong reason to do it one way or the other. Maybe start
with requiring both u and v odd, and relax requirements later in case we
find any more tangible benefit from doing that?

Regards, 
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.