hgcd1/2

Wed Sep 11 14:36:51 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  Sounds good. Give it a few days, and delete if it still looks slow
  everywhere.

I cooked a modern alternative:

static mp_double_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
  mp_double_limb_t res;
  int ncnt, dcnt, cnt;
  mp_limb_t q;
  mp_limb_t mask;

  ASSERT (n0 >= d0);

  count_leading_zeros (ncnt, n0);
  count_leading_zeros (dcnt, d0);
  cnt = dcnt - ncnt;

  d0 <<= cnt;

  q = -(mp_limb_t) (n0 >= d0);
  n0 -= d0 & q;
  d0 >>= 1;
  q = -q;

  while (--cnt >= 0)
    {
      mask = -(mp_limb_t) (n0 >= d0);
      n0 -= d0 & mask;
      d0 >>= 1;
      q = (q << 1) - mask;
    }

  res.d0 = n0;
  res.d1 = q;
  return res;
}

  That should use the same choices for div1 and div2. Is it important to
  inline some or all of the div1/div2 code? If we move them out to
  separate .c files, it gets easier to share between hgcd2 and
  hgcd2_jacobi, and easier to experiment with assembly div1. But it gets a
  bit more complex to set things up for tuneup.

Not considering tuneup hardness, I suppose inlining makes some sense, at
least of _METHOD 1 which often ends up as a single instruction.

I've never been a big fan of inlining.  Small operands GCD in GMP is
special in many ways, as it is hundreds of times slower than
multiplication.  We really need to try hard to improve it.

-- 
Torbjörn
Please encrypt, key id 0xC8601622