General mpn_gcd_basecase

Fri Aug 30 11:56:14 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  I doubt sign multiply would be that useful. An instruction to do signed
  64b x unsigned 64b --> signed 128b, would be nice, but I don't know any
  arch having that.

Unsigned ans signed widening multiplication can be done in terms of ther
other.  There is code in longlong.h for doing that:

    /* If we lack umul_ppmm but have smul_ppmm, define umul_ppmm in terms of
       smul_ppmm.  */
    #if !defined (umul_ppmm) && defined (smul_ppmm)
    #define umul_ppmm(w1, w0, u, v)                                         \
      do {                                                                  \
        UWtype __w1;                                                        \
        UWtype __xm0 = (u), __xm1 = (v);                                    \
        smul_ppmm (__w1, w0, __xm0, __xm1);                                 \
        (w1) = __w1 + (-(__xm0 >> (W_TYPE_SIZE - 1)) & __xm1)               \
                    + (-(__xm1 >> (W_TYPE_SIZE - 1)) & __xm0);              \
      } while (0)
    #endif

I think signed x unsigned is about half that hard.  :-)

  Just suggesting that we'd may get slightly less latency with

        a = up[0] - m*vp[0];
        count_trailing_zeros (cnt, a);
        cy = mpn_addmul_1 (up, vp, n, m);

  Maybe not so useful on it's own, but with assembly, there are further
  opportunities. 

Like doing rshift on-the-fly.

  > This addmul_1-only, positive-only table is a strange gcd animal.  It
  > will never yield U = 0.  Individual limbs are not unlikely to become
  > zero.

  Ooops. Then one definitely need a different stop condition than comparing
  something to zero.

I tried the stop condition U = 0 first, and it seemed to run really
slowly.  :-)

  If your willing to have both count_leading_zeros and
  count_trailing_zeros in the loop, maybe it's simpler and more efficient
  with a table-based binary euclid?

    CND_SWAP
    count_leading_zeros(cnt, up[n-1] | vp[n-1])
    q = tab[...high bits...]

    if (q odd)
      u <-- u - q v
    else
      u <-- (q+1) u - v

    count_trailing_zeros(cnt, up[0]);  /* At least one */
    u >>= cnt;

I don't follow the u assignment in the else clause.  And even*odd-odd
would seem to become odd.

Perhaps you meant u <-- u - (v - q)v?

But a table-based Euclid would be interesting!

On many machines, count_*ing_zeros are cheap.  On AMD Ryzen, _leading_
has the same latency and throughput as plain addition (latency = 1,
throughput 4/cycle), while _trailing_ has twice the latency and half the
throughput.  On Intel the latency is worse, 3 cycles for both trailing
and leading with a throughput of 1/cycle, but that's still not bad.

One might unroll Euclid 2x and only recompute _leading_ for the affected
operand.

Would you give your table-based Euclid a try?  Please use
~tege/gcd/test-gcd_33.c for testing, replacing the _33 call with an _n
call.  (I'll might make a soecial test_gcd_n.c at some point.)

  It seems like an mpn_rsbmul_1 would be a useful primitive for several
  variants.

Yep.  I never even considered that in GMP until now.

  Sounds like the opposite of binary euclid.

Perhaps!

-- 
Torbjörn
Please encrypt, key id 0xC8601622