General mpn_gcd_basecase

Thu Aug 29 14:12:04 UTC 2019

I switched focus to a general mpn_gcd_basecase, using the table-based
approach Niels and me discussed a week ot two ago here.

I decided to still keep away from data-dependent branches, which may not
be the best solution as the operands grow.

Ignoring unlikely cases, we have this loop:

 while (n > 2)
    {
      CMP_SWAP (up, vp, n);

      a = up[0];
      b = vp[0];
      m = tab[((a & mask) << (4 - 1)) + ((b & mask) >> 1)];

      cy = mpn_addmul_1 (up, vp, n, m);
      a = up[0];
      count_trailing_zeros (cnt, a);
      mpn_rshift (up, up, n, cnt);
      up[n - 1] |= cy << (GMP_LIMB_BITS - cnt);

      if (mpn_cmp (up, vp, n) == 0)
         we're done

      n -= (up[n - 1] | vp[n - 1]) == 0;
    }

This isn't a great loop.  It passes through the operands twice (addmul_1
and rshift, the cmp typically exits immediately). It also uses the
only-positive-elements variant of the table, which means much poorer
progress.

How could it be improved?  If we allow data-dependent jumps, we could
use a more optimal table.  We'd then conditionally call addmul_1 or
submul_1.  The latter would sometimes yield negative results.  We could
avoid those by having an "rsbmul_1".  We could use a sloppy way of
choosing between submul_1 and rsbmul_1, and fixup poor choices when we
occasionally got a negative result.

Data-dependent jumps will behaps be OK for n more than, say, n=5.  I'd
like a single run through the data with 3-epsilon bits of progress per
iteration, and no branches for typical iteration.  We might be able to
get rid of rshift by scaling the multiplier m, i.e., if effect let
operands' low bit be at a distance from the lowest limb bit.

We don't need to limit ourselves to using *mul_1 loops.  We could use
something slightly more advanced.

These algorithms will be useful for CPUs with fast *mul_1 loops.  Amd
Ryzen and Intel *lake all have such loops which run at less than 2
cycles/limb.

Ideas are welcome!  Great, working code is even more welcome!

Here is the full code for the basic algorithm outlined above:

void
mpn_gcd_basecase (mp_limb_t *rp, mp_limb_t *up, mp_limb_t *vp, mp_size_t n)
{
  static unsigned char tab[256] = {
 31, 21, 19,  9,  7, 29, 27, 17, 15,  5,  3, 25, 23, 13, 11,  1,
 29, 31, 25, 27, 21, 23, 17, 19, 13, 15,  9, 11,  5,  7,  1,  3,
 27,  9, 31, 13,  3, 17,  7, 21, 11, 25, 15, 29, 19,  1, 23,  5,
 25, 19,  5, 31, 17, 11, 29, 23,  9,  3, 21, 15,  1, 27, 13,  7,
 23, 29, 11, 17, 31,  5, 19, 25,  7, 13, 27,  1, 15, 21,  3,  9,
 21,  7, 17,  3, 13, 31,  9, 27,  5, 23,  1, 19, 29, 15, 25, 11,
 19, 17, 23, 21, 27, 25, 31, 29,  3,  1,  7,  5, 11,  9, 15, 13,
 17, 27, 29,  7,  9, 19, 21, 31,  1, 11, 13, 23, 25,  3,  5, 15,
 15,  5,  3, 25, 23, 13, 11,  1, 31, 21, 19,  9,  7, 29, 27, 17,
 13, 15,  9, 11,  5,  7,  1,  3, 29, 31, 25, 27, 21, 23, 17, 19,
 11, 25, 15, 29, 19,  1, 23,  5, 27,  9, 31, 13,  3, 17,  7, 21,
  9,  3, 21, 15,  1, 27, 13,  7, 25, 19,  5, 31, 17, 11, 29, 23,
  7, 13, 27,  1, 15, 21,  3,  9, 23, 29, 11, 17, 31,  5, 19, 25,
  5, 23,  1, 19, 29, 15, 25, 11, 21,  7, 17,  3, 13, 31,  9, 27,
  3,  1,  7,  5, 11,  9, 15, 13, 19, 17, 23, 21, 27, 25, 31, 29,
  1, 11, 13, 23, 25,  3,  5, 15, 17, 27, 29,  7,  9, 19, 21, 31
  };

  const mp_limb_t mask = (2 << 4) - 2;

  mpn_zero (rp, n);

 while (n > 2)
    {
      mp_limb_t a, b, m, cy;
      int cnt;

      CMP_SWAP (up, vp, n);

      a = up[0];
      b = vp[0];
      m = tab[((a & mask) << (4 - 1)) + ((b & mask) >> 1)];

      cy = mpn_addmul_1 (up, vp, n, m);
      a = up[0];
      if (UNLIKELY (a == 0))
	{
	  int all_zeros = 1;
	  mp_size_t i;
	  for (i = 1; i < n; i++)
	    {
	      if (up[i] != 0)
		{
		  all_zeros = 0;
		  break;
		}
	    }

	  if (all_zeros)
	    {
	      ASSERT_ALWAYS (cy != 0);

	      count_trailing_zeros (cnt, cy);
	      up[0] = cy >> cnt;
	      mpn_zero (up + 1, n - 1);
	    }
	  else
	    {
	      mpn_copyi (up, up + i, n - i);
	      up[n - i] = cy;
	      i -= cy != 0;
	      mpn_zero (up + n - i, i);

	      ASSERT_ALWAYS (up[0] != 0);

	      if ((up[0] & 1) == 0)
		{
		  int cnt;
		  count_trailing_zeros (cnt, up[0]);
		  mpn_rshift (up, up, n - i, cnt);
		}
	    }
	}
      else
	{
	  count_trailing_zeros (cnt, a);
	  mpn_rshift (up, up, n, cnt);
	  ASSERT_ALWAYS (cnt);
	  up[n - 1] |= cy << (GMP_LIMB_BITS - cnt);
	}

      if (mpn_cmp (up, vp, n) == 0)
	{
	  mpn_copyi (rp, up, n);
	  return;
	}

      n -= (up[n - 1] | vp[n - 1]) == 0;
    }

 mp_double_limb_t g = mpn_gcd_22 (up[1], up[0], vp[1], vp[0]);
 rp[0] = g.d0;
 rp[1] = g.d1;
}

-- 
Torbjörn
Please encrypt, key id 0xC8601622