# gcd_11

Niels Möller nisse at lysator.liu.se
Fri Aug 16 22:02:16 UTC 2019

```After discussing several interesting gcd algorithms with Torbjörn, I've
made this observation that I think is kind-of interesting.

Lehmer code depends on the fact that if M is a 2x2 integer matrix with
determinant ±1, and

(a';b') = M (a;b),

then gcd (a,b) = gcd (a', b'). But for binary gcd, we can relax this: If
the matrix determinant is a power of two (in absolute value), then
gcd(a,b) and gcd(a',b') can differ only by a power of two. And in the
binary algorithm, we discard powers of two as we go, so they're no
problem.

In addition, many small matrices have power-of-two determinant, e.g.,
det(1,1;1,-3) = -4.

So we can choose a', b' quite freely among various linear combinations,
e.g., a+b, a-b, a+3b; we can chose any pair as long as the corresponding
determinant is a power of two. One way to arrange it that I think may be
simple and efficient is to chose constants m, k so that

a' = a + kb = 0 (mod 8)
b' = a + mb = 4 (mod 8)

This can be done by first choosing a linear combination that is
divisible by 4, and set t = (a + b) / 4 or t = (a + 3b) / 4 (that's
pretty standard).

But next, form a', b' as t and t - b, one of which is even and the
other odd. Corresponding matrices seem to satisfy the requirement: since
det (1,k;1,m) = m - k, we can use any distinct pair of numbers in {-3,
-1, 1, 3 } except the pair ±3.

Algorithm (which is made a bit more complicated by the implicit lsb
trick):

mp_limb_t
mpn_gcd_11 (mp_limb_t u, mp_limb_t v)
{
ASSERT (u & v & 1);

/* In this loop, we represent the odd numbers ulimb and vlimb
without the redundant least significant one bit. This reduction
in size by one bit avoids overflow. */
u >>= 1;
v >>= 1;

for (;;)
{
mp_limb_t n0, n1;
int c;
/* In this iteration, we set

u' = u + k v = 0 (mod 8), v' = u + m v = 4 (mod 8),

with k, m chosen from the set { -3, -1, 1, 3 }. It's arranged
bitwise; first set t = (a + {1,3} b) / 4, then use t and t -
b for the next iteration, keeping track of which one is odd so
that we only need one count_trailing_zeros.
*/
mp_limb_t t = u + v + 1;  /* (u + v) / 2, explicit lsb */
mp_limb_t m0 = - (t & 1);
t = (t >> 1) + (m0 & (v + 1)); /* ((u + v)/ 2 + m0 v) / 2 */
mp_limb_t m1 = - (t & 1);
u = (t >> 1) - (m1 & v); /* Make even, then shift out lsb 0 */
if (u == 0)
return (v << 1) + 1;
count_trailing_zeros (c, u);
v = (t >> 1) + (~m1 & ~v); /* Make odd, shift out lsb 1 */

/* Take absolute values */
u = ((u^n0) - n0) >> (c+1);
/* Negation only happens when ~m1, no -n1 since that's canceled
by the implicit bit. */
v ^= n1;
}
}

Seems to work, but not particularly fast :-( It's unclear to me what
progress actually is made in one iteration. There's also a long
dependency chain, it could probably be shortened a bit by computing the
m1 mask directly from the inputs, so it could be scheduled in parallel
with theoperations involving m0.

I also wonder if it can be extended to larger matrices; if one can find
matrices with k bit elements that lets us clear 2k bits most of the
time? Power-of-two determinant is a constraint, but there are still lots
of matrices.

The idea is a bit similar to the binary recursive gcd by Stehlé and
Zimmermann, but here we're aiming for fairly small numbers with matrix
selected by table lookup, with no need for a divide-and-conquer
algorithm.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

```