General mpn_gcd_basecase

Fri Aug 30 09:01:11 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  If we allow data dependent branches, would it be unreasonable to go all
  the way and make the table a jump table, something like

    switch (((a & mask) << (4 - 1)) + ((b & mask) >> 1)) { ...}

  ?

  If we can afford a fair amount of different cases, I find the variant
  choosing between alternatives  a±b, a±3b, (a, b) <-- (3 a - 5b, 3a +
  5b) etc neat. Shifting at least 4 bits per iteration. 

Ah, so you're thinking of not just a+b*tab_b[...] but
a*tab_a[...]_b*tab_b[...]?  I understand the latter much less, and don't
know how to deal with the introduced spurious factors which seem
inevitable.

  It's tempting to use two's complement to deal with negative values.

I have that on my mental list of ideas.  How can we do a bignum*smallnum
when either or both are negative without sign extending smallnum?

Handling even negative values for the bignum here makes sense.

  I see two complications:

  1. We need an efficient way to find out which of |u| and |v| is
     smallest, for CMP_SWAP. But we can likely get away with something a
     bit sloppy, and accept arbitrary results in the case that u and v are
     "close", e.g., same bit size.

We do that now.  We just compare the top limbs.  (It works well when
there are many bits in the top limbs, but as the algorithm progresses,
we will typically have very few bits here, making the comparison quite
poor.  The same top-limb bit shortage is a problem also for my next
attempt, see below.)

  2. For the variant without branches, we'd need an mpn_addmul_1s working
     with two's complement. It could be implemented as mpn_addmul_1 +
     mpn_cnd_sub, but not sure if there's some easy way to implement with
     a single pass, somehow sign extending on the fly.

This might be easy or it might be hard.  We have signed 64b x 64b ->
128b on some machines (including x86) which might be useful.

  >       cy = mpn_addmul_1 (up, vp, n, m);
  >       a = up[0];
  >       count_trailing_zeros (cnt, a);

  One could move the count_trailing_zeros before mpn_addmul_1 by
  duplicating the mul, a = up[0] + m * vp[0]. Would make more sense if we
  also have an addmul + shift loop.

I don't follow your reasoning here.

Are yu suggesting tat we, at the beginning of addmul_1, insert a
count_trailing_zeros and then apply that in the loop.  That would be
doable. and OoO execution could help.  (It would have to handle up[0] +
m * vp[0] = 0, but not efficiently.)

The gcd_basecase loops I am playing with are simple enough that we may
write them in assembly, with any special addmul_1/submul_1 inlined.
Such special addmul_1/submul_1 could balk at unlikely arithmetic
outcomes, and instead focus on making the common cases fast.

  >       mpn_rshift (up, up, n, cnt);
  >       up[n - 1] |= cy << (GMP_LIMB_BITS - cnt);
  >
  >       if (mpn_cmp (up, vp, n) == 0)
  >          we're done

  The large gcd exit case belongs inside the if (UNLIKELY (a == 0)) branch
  needed for count_trailing_zeros.

It does?  I don't think that would work.

This addmul_1-only, positive-only table is a strange gcd animal.  It
will never yield U = 0.  Individual limbs are not unlikely to become
zero.

I dabbled in a variant which calls addmul_1 or submul, depending on of U
and V are close, or if U is much greater than V.  The first variant
looks like this (with some unlikely corner cases removed for clarity):

 while (n > 2)
    {
      CMP_SWAP (up, vp, n);

      a = up[0];
      b = vp[0];
      m = tabp[((a & mask) << (4 - 1)) + ((b & mask) >> 1)];

      if ((up[n - 1] >> 5) > vp[n - 1])
	cy = mpn_submul_1 (up, vp, n, 32 - m);
      else
	cy = mpn_addmul_1 (up, vp, n, m);

      a = up[0];
      count_trailing_zeros (cnt, a);
      mpn_rshift (up, up, n, cnt);
      up[n - 1] |= cy << (GMP_LIMB_BITS - cnt);

      if (mpn_cmp (up, vp, n) == 0)
	{
	  mpn_copyi (rp, up, n);
	  return;
	}

      n -= (up[n - 1] | vp[n - 1]) == 0;
    }

This variant presumably approaches 3 bits/iteration.  The reliance on
the top limbs for making decisions might be a real problem here (not for
correctness, the complete code works!).  Using the top two limbs for the
addmul_1/submul_1 decision (and possibly CMP_SWAP) could help, but comes
at a cost.  (Note that the branch-free CMP_SWAP is pretty silly now that
we spli up execution anyway.)

So why do I need both addmul_1 and submul_1 here at all?  Couldn't one
use only submul?  The problem is when U and V are close.  U - mV would
then often be negative.  We'd need mV - U, either by post-negation or by
making a good pre-determination of which of the two expression are
negative, and chose the right one.  (We would need a "rsbmul_1" loop
too.)

Lots of degrees of freedome.  :-/

My latest loop loks like this:

 while (n > 2)
    {
      mp_limb_t a, b, m, cy;
      int cnt, ucnt, vcnt;

      CMP_SWAP (up, vp, n);

      a = up[0];
      b = vp[0];
      m = tabp[((a & mask) << (4 - 1)) + ((b & mask) >> 1)];

      count_leading_zeros (ucnt, up[n - 1]);
      count_leading_zeros (vcnt, vp[n - 1]);

      cnt = vcnt - ucnt;
      if (cnt >= 6)
	{
	  m = (CNST_LIMB(1) << cnt - 1) - m;
	  cy = mpn_submul_1 (up, vp, n, m);
	}
      else if (vcnt > 5 && (up[n - 1]) > ((vp[n - 1] << 5) | (vp[n - 2] >> 59)))
	{
        /* This case is probably only marginally useful as most cases
           are covered by the case above. */
	  m = 32 - m;
	  cy = mpn_submul_1 (up, vp, n, m);
	}
      else
	{
	  cy = mpn_addmul_1 (up, vp, n, m);
	}

      a = up[0];
      ... the rest is identical to the code above ...
    }

Comments:

1. The count_leading_zeros will be fed with zero up[n-1] or vp[n-1] now
   and then.  It is not always well-defined for that case by longlong.h
   (see COUNT_LEADING_ZEROS_0 in longlong.h).  Furthermore, when that
   happens, even when we make sure the result becomes the expected 64,
   our decisions based on the top limbs only are poor in this situation.

2. The idea with m = (CNST_LIMB(1) << cnt - 1) - m is to chop of a bit
   also in the top end of U, not just get >= 5 bits of zeros in the low
   end.  It is a generalisation of the tables I played with before for
   gcd_11; any entry can be offset by a multiple of 2^5 = 32 without
   making less than 5 low bit become 0.  I actually believe this is a
   great idea, but I haven't checked if it is as effective as I think it
   is...

3. Yes, we have several data dependent random branches now.  This will
   not beat the previous _22 or _33 on their turf.  But for larger n it
   will not be too bad.

I suppose I should measure the speed of the present hgcd2 based code to
estimate a limit of a gcd_basecase's utility.  That bound is surely
going to decrease when we improve hgcd2.

-- 
Torbjörn
Please encrypt, key id 0xC8601622