hgcd1/2

Sat Sep 7 14:30:40 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> I'm a bit surprised that the METHOD 3 div1 didn't yield more speedup.
> The hgcd2 code is almost content with a 25 cycle / operation, which is
> really, really odd to me.

Keep in mind that the optimizations to div1 so far only help the second,
cheaper, half of hgcd2. The first half, working with double precision,
still uses the div2 function, which is full of branches.

> The quotient is < 8 in over 80% of the time.  It is < 4 almost 70% of
> the time.  Perhaps we should aim for handling just < 4 to save some
> time.

That will likely be faster on some machines. Depending on how expensive
division is. And fairly easy.

> Then we could compare (a mod b)/2^s with (b - a mod b)/2^t where s and t
> are the number of trailing bits in respective expression.

Taking out powers of two makes things complicated. Taking out a factor
of two from one of the numbers corresponds to a matrix (1,0;0,2) with
determinant 2. So the inverse is not an integer matrix, but an integer
matrix + a shift count.

Switching the sign, i.e., choosing the smallest of a mod b and -a mod b,
should be a bit easier, but we'll need to handle that the matrix
determinant will sometimes be -1 rather than +1.

> We could then perhaps find some middle ground which doesn't try to find
> the absolutely minimal remainder, but one which avoids negative numbers
> which hurts...

hgcd2 kind of does that already. When it goes to single limb precision,
it discards the low half limb, and it will not always produce the
minimal remainders.

It would be interesting to try some left-to-right binary hgcd. Logic
should be something like

  count_leading_zeros(a_bits, a);
  count_leading_zeros(b_bits, b);

  if (a_bits == b_bits)
    (a, b) <-- (min(a,b), |a-b|)
  else if a_bits > b_bits
    a <-- |a - b * 2^{a_bits - b_bits}|
  else 
    b <-- |b - a * 2^{b_bits - a_bits}|

The corresponding matrix updates are multiplication free, just shift and
add. To make it competitive, it needs to be branch free. But it will
make less progress than euclid. One could avoid some of the absolute
values by using

  else if a_bits > b_bits
    a <-- a - b * 2^{a_bits - b_bits - 1}

but with less progress per iteration.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.