General mpn_gcd_basecase

Niels Möller nisse at
Fri Aug 30 10:33:07 UTC 2019

tg at (Torbjörn Granlund) writes:

> Ah, so you're thinking of not just a+b*tab_b[...] but
> a*tab_a[...]_b*tab_b[...]?  I understand the latter much less, and don't
> know how to deal with the introduced spurious factors which seem
> inevitable.

If you do

  a <-- s a + t b, b unchanged

then you can get spurious factors from s. And in the case of the binary
algorithm, that's fine if s is a power of 2. (I think you mentioned that
earlier: Try to find k so that 2^k mod b is a few bits smaller than b,
and set a <-- 2^k a - q b, to increase euclid progress).

If we update both a and b with a matrix multiply, (a;b) <-- M (a;b), we
can get spurious factors from the determinant of M.

In an earlier mail I sketched a variant that update both a and b only in
a few of the cases, and then only using matrices where the determinant
happens to be a power of 2.

>   I see two complications:
>   1. We need an efficient way to find out which of |u| and |v| is
>      smallest, for CMP_SWAP. But we can likely get away with something a
>      bit sloppy, and accept arbitrary results in the case that u and v are
>      "close", e.g., same bit size.
> We do that now.  We just compare the top limbs.

Only assuming u and v unsigned, right? For two's complement, we need to
take absolute values into account in the comparison in some way.

>   2. For the variant without branches, we'd need an mpn_addmul_1s working
>      with two's complement. It could be implemented as mpn_addmul_1 +
>      mpn_cnd_sub, but not sure if there's some easy way to implement with
>      a single pass, somehow sign extending on the fly.
> This might be easy or it might be hard.  We have signed 64b x 64b ->
> 128b on some machines (including x86) which might be useful.

I doubt sign multiply would be that useful. An instruction to do signed
64b x unsigned 64b --> signed 128b, would be nice, but I don't know any
arch having that.

>   >       cy = mpn_addmul_1 (up, vp, n, m);
>   >       a = up[0];
>   >       count_trailing_zeros (cnt, a);
>   One could move the count_trailing_zeros before mpn_addmul_1 by
>   duplicating the mul, a = up[0] + m * vp[0]. Would make more sense if we
>   also have an addmul + shift loop.
> I don't follow your reasoning here.

Just suggesting that we'd may get slightly less latency with

      a = up[0] - m*vp[0];
      count_trailing_zeros (cnt, a);
      cy = mpn_addmul_1 (up, vp, n, m);

Maybe not so useful on it's own, but with assembly, there are further

> This addmul_1-only, positive-only table is a strange gcd animal.  It
> will never yield U = 0.  Individual limbs are not unlikely to become
> zero.

Ooops. Then one definitely need a different stop condition than comparing
something to zero.

> My latest loop loks like this:
>  while (n > 2)
>     {
>       mp_limb_t a, b, m, cy;
>       int cnt, ucnt, vcnt;
>       CMP_SWAP (up, vp, n);
>       a = up[0];
>       b = vp[0];
>       m = tabp[((a & mask) << (4 - 1)) + ((b & mask) >> 1)];
>       count_leading_zeros (ucnt, up[n - 1]);
>       count_leading_zeros (vcnt, vp[n - 1]);

If your willing to have both count_leading_zeros and
count_trailing_zeros in the loop, maybe it's simpler and more efficient
with a table-based binary euclid?

  count_leading_zeros(cnt, up[n-1] | vp[n-1])
  q = tab[...high bits...]

  if (q odd)
    u <-- u - q v
    u <-- (q+1) u - v

  count_trailing_zeros(cnt, up[0]);  /* At least one */
  u >>= cnt;

(plus handling of less likely special cases).

It seems like an mpn_rsbmul_1 would be a useful primitive for several

> 2. The idea with m = (CNST_LIMB(1) << cnt - 1) - m is to chop of a bit
>    also in the top end of U, not just get >= 5 bits of zeros in the low
>    end.  It is a generalisation of the tables I played with before for
>    gcd_11; any entry can be offset by a multiple of 2^5 = 32 without
>    making less than 5 low bit become 0.  I actually believe this is a
>    great idea, but I haven't checked if it is as effective as I think it
>    is...

Sounds like the opposite of binary euclid.


Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list