gcd_11

Sat Aug 17 08:59:00 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  After discussing several interesting gcd algorithms with Torbjörn, I've
  made this observation that I think is kind-of interesting.

Let me show the most algorithm I came up with in this discussion will
Niels!

I have long thought that table-driven GCD should be feasible.  Every
time I have failed to construct an algorithm with small enough table.

Now, by starting at the simple observation that, for the odd a and b in
Stein's "binary algorithm", either a-b or a+b is divisible by at least 4
(while the other is always divisible by 2 but not 4), I made some
progress.

How about choosing not just between a-b and a+b, but between a+mb for
small m \in Z?  For every 2nd m we have 4 | a+mb, for every 4th m we 8 |
a+mb have, etc.  That means that some small m will allow us to divide
out many zero bit.

Fortunately, such divisibility depends on just the low few bits of a and
b, i.e., we can pre-compute suitable m and table them, and then retrieve
a great m by indexing with the few low bits of each a and b we come
across!

Here is the main algorithm, with a given table and k, the latter being
log_2 of the table dimension:

    static mp_limb_t
    gcd_bin_tabkm (mp_limb_t x, mp_limb_t y, signed char *tab, size_t k)
    {
      int cnt;

      // The main algorithm cannot handle full words, use plain Stein in order to
      // chop away a few bits.
      while ((x | y) >> (62 - k))
        {
          if (x > y)
            {
              x -= y;
              count_trailing_zeros (cnt, x);
              x >>= cnt;
            }
          else if (y > x)
            {
              y -= x;
              count_trailing_zeros (cnt, y);
              y >>= cnt;
            }
          else
            return x;
        }

      mp_limb_signed_t a = x;
      mp_limb_signed_t b = y;
      size_t mask = (2 << k) - 2;

      // Main loop.
      while (1)
        {
          mp_limb_signed_t m;
          if (a > b)
            {
              // a = a + bm
              m = tab[((a & mask) << (k - 1)) + ((b & mask) >> 1)];

              a = a + b * m;
              count_trailing_zeros (cnt, a);
              a >>= cnt;
              if (a < 0) a = -a;
              if (a == 0)
                return b;
            }
          else
            {
              // b = b + am
              m = tab[((b & mask) << (k - 1)) + ((a & mask) >> 1)];

              b = b + a * m;
              count_trailing_zeros (cnt, b);
              b >>= cnt;
              if (b < 0) b = -b;
              if (b == 0)
                return a;
            }
        }
    }

Here are some variants which pass gradually large tables:

    static mp_limb_t
    gcd_bin_tab4 (mp_limb_t a, mp_limb_t b, size_t *np)
    {
      static signed char tab[4] = {
        -1,  1,
         1, -1,
      };
      return gcd_bin_tabkm (a, b, np, tab, 1);
    }

    static mp_limb_t
    gcd_bin_tab16 (mp_limb_t a, mp_limb_t b, size_t *np)
    {
      static signed char tab[16] = {
    #if ! defined (TAB_VARIANT) || TAB_VARIANT == 4 || TAB_VARIANT > 4
     -1, -3, -5, -7,
     -3, -1, -7, -5,
     -5, -7, -1, -3,
     -7, -5, -3, -1,
    #endif
      };
      return gcd_bin_tabkm (a, b, np, tab, 2);
    }

    static mp_limb_t
    gcd_bin_tab64 (mp_limb_t a, mp_limb_t b, size_t *np)
    {
    static signed char tab[64] = {
    #if ! defined (TAB_VARIANT) || TAB_VARIANT == 7 || TAB_VARIANT > 8
     -1,-11,-13, -7, -9, -3, -5,  1,
     -3, -1, -7, -5,-11, -9,  1,-13,
     -5, -7, -1, -3,-13,  1, -9,-11,
     -7,-13,-11, -1,  1, -5, -3, -9,
     -9, -3, -5,  1, -1,-11,-13, -7,
    -11, -9,  1,-13, -3, -1, -7, -5,
    -13,  1, -9,-11, -5, -7, -1, -3,
      1, -5, -3, -9, -7,-13,-11, -1,
    #endif
    };
      return gcd_bin_tabkm (a, b, np, tab, 3);
    }

    static mp_limb_t
    gcd_bin_tab256 (mp_limb_t a, mp_limb_t b, size_t *np)
    {
      static signed char tab[256] = {
    #if ! defined (TAB_VARIANT) || TAB_VARIANT == 14 || TAB_VARIANT > 16
     -1,-11,-13,-23,-25, -3, -5,-15,-17,-27,  3, -7, -9,-19,-21,  1,
     -3, -1, -7, -5,-11, -9,-15,-13,-19,-17,-23,-21,-27,-25,  1,  3,
     -5,-23, -1,-19,  3,-15,-25,-11,-21, -7,-17, -3,-13,  1, -9,-27,
     -7,-13,-27, -1,-15,-21, -3, -9,-23,  3,-11,-17,  1, -5,-19,-25,
     -9, -3,-21,-15, -1,-27,-13, -7,-25,-19, -5,  1,-17,-11,  3,-23,
    -11,-25,-15,  3,-19, -1,-23, -5,-27, -9,  1,-13, -3,-17, -7,-21,
    -13,-15, -9,-11, -5, -7, -1, -3,  3,  1,-25,-27,-21,-23,-17,-19,
    -15, -5, -3,-25,-23,-13,-11, -1,  1,-21,-19, -9, -7,  3,-27,-17,
    -17,-27,  3, -7, -9,-19,-21,  1, -1,-11,-13,-23,-25, -3, -5,-15,
    -19,-17,-23,-21,-27,-25,  1,  3, -3, -1, -7, -5,-11, -9,-15,-13,
    -21, -7,-17, -3,-13,  1, -9,-27, -5,-23, -1,-19,  3,-15,-25,-11,
    -23,  3,-11,-17,  1, -5,-19,-25, -7,-13,-27, -1,-15,-21, -3, -9,
    -25,-19, -5,  1,-17,-11,  3,-23, -9, -3,-21,-15, -1,-27,-13, -7,
    -27, -9,  1,-13, -3,-17, -7,-21,-11,-25,-15,  3,-19, -1,-23, -5,
      3,  1,-25,-27,-21,-23,-17,-19,-13,-15, -9,-11, -5, -7, -1, -3,
      1,-21,-19, -9, -7,  3,-27,-17,-15, -5, -3,-25,-23,-13,-11, -1,
    #endif
       };
      return gcd_bin_tabkm (a, b, np, tab, 4);
    }

What about progress of the algorithm?

Algorithm                         Iterations  Iter/Bit  Bit/Iter  Max iter
plain binary                   :   7587840328   0.697    1.435       60
binary+-                       :   6611310442   0.607    1.647       59
binary tab4                    :   6657355138   0.611    1.636       59
binary tab16                   :   4803839583   0.441    2.267       42
binary tab64                   :   4364358404   0.401    2.495       43
binary tab256                  :   4148274655   0.381    2.625       44
binary tab1024                 :   3951334269   0.363    2.756       37
binary tab4096                 :   3884273290   0.357    2.803       38

With the largest table above "table256", we get about 2.6
bits/iteration, compared to 1.4 for the plain binary algorithm.

So is this practical?  We haven't made a good implementation yet.  Such
an implementation would need to eliminate all data-dependent branches.

-- 
Torbjörn
Please encrypt, key id 0xC8601622