gcd_11

Sat Aug 17 10:25:05 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

>     static mp_limb_t
>     gcd_bin_tab16 (mp_limb_t a, mp_limb_t b, size_t *np)
>     {
>       static signed char tab[16] = {
>     #if ! defined (TAB_VARIANT) || TAB_VARIANT == 4 || TAB_VARIANT > 4
>      -1, -3, -5, -7,
>      -3, -1, -7, -5,
>      -5, -7, -1, -3,
>      -7, -5, -3, -1,
>     #endif
>       };
>       return gcd_bin_tabkm (a, b, np, tab, 2);
>     }

And one has some freedom to choose representativs of these numbers mod
2^{k+1}. E.g., one might replace -7 by +1 and/or -5 by +3.

> So is this practical?  We haven't made a good implementation yet.  Such
> an implementation would need to eliminate all data-dependent branches.

It could be arrange as first doing
  t = u - v;
  v <-- min (u,v)
  u <-- |t|

with the same masking tricks as in the current gcd_11. Then, to
eliminate some additional bits of u, do a table lookup (based on u, v
*after* the above operations) for an even value m, and set

  u <-- u + m v

Multiplication with additiona and masks is possible, but I'm afraid the
needed shifts to get v added in at the right position may be slow.

if u == 0, terminate, otherwise set c = count_trailing_zeros(u).

If we use any negative candidates for m, do conditional negation to get

  u <-- |u|

and finally shift u by c bits.

Compared to gcd_11, we get a few more bits of progress. The
additonal operations are the table lookup (could be shifts of a magic
constant), the multiplication, and the absolute value.

-----

On a somewhat different track, I've reread the paper on k-ary gcd, and
lets focus on 16-ary, eliminating a least 4 bits at a time.

Given u, v odd, u > v, one can compute the quotient q = u/v mod 16; this
is the same multiplier that you tabulate. Let's use representatitives
±1, ±3, ± 5, ±7.

One can then set u <-- (u - q v) / 16 (which may be odd or even). But
that won't always give 4 bits of progress; if q v > u we get some growth
at the high end, and this happens when q is largish and u and v are
close in size. When q is small, ±1 or ±3, progress should aways be good
is good.

The k-ary trick comes in when q is largish. When q = 5, instead of

  u <-- (u - 5v) / 16

we could try 

  u <-- (3u + v) / 16

Unfortunately, that may introduce a spurious factor of 3, since gcd (3u
+ v, v) = gcd (3u, v), and probably not worth it, so lets stick to u -
5v in this case, and consider larger q.

If q = 7, then instead of u <-- u - 7v, either of the linear combinations

  u <-- (3u - 5v) / 16 
  u <-- (5u - 3v) / 16 

gives good progress for u, but with risk for spurious factors. However,
if we use *both* combinations,

  u <-- (5u - 3v) / 16 
  v <-- (3u - 5v) / 16 

then we can note that the determinant is -9 + 25 = -16, is a power of
two! So we get avoid the spurious factor problem, at the cost of some
growth of min (u,v) when q = ±7; the -7 case is a bit worse, with

  u <-- (5u + 3v) / 16 
  v <-- (3u + 5v) / 16 

Here v is still the smaller number, and worst case growth is from close
to zero to close to 3u/16. Maybe we get good enough progress on average
so that it's worth it? 

We could also adapt, and use the u <-- u - 7 v when v is small compared
to u, and the k-ary linear combinations when u and v are close in size.
That should give pretty good progress in every iteration, but perhaps
it gets to complicated to have a chance of running fast.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.