GCD project status?

Tue Sep 17 21:02:01 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> Ideally, one would compile hgcd2.c in all possible variants (presumable
> through hand-crafted hgcd2-1-1.c, hgcd2-2-1.c, etc., and then call the
> selected hgcd2's entry point through a function pointer for further
> measuring.

Hmm. So the hgcd2-x-y.c would be there als for speed, and hgcd2.c would
have something like

#if TUNE_PROGRAM_BUILD
int mpn_hgcd2(...)
{
  return (*hgcd2_method_pointer)(...);
}

> I'm pretty confused about the algorithms used for gcdext.  Is it at all
> helped by our hgcd2 improvements?

If you look at mpn_gcdext_lehmer_n, that's a loop around hgcd2, applying
each matrix to the numbers being reduced using
mpn_matrix22_mul1_inverse_vector, and to the cofactors using
mpn_hgcd_mul_matrix1_vector.

At the end, it calls mpn_gcdext_1. mpn_gcdext_22 is missing.

> What is suboptimal about gcdext for large operands?

Subquadratic gcd is a loop calling hgcd on the high part of the numbers
(iirc, n/3 size), then applies the matrix also to the low part, and
repeat until numbers are in lehmer range.

Current subquadratic gcdext does the same, but also builds up cofactors
along the way. So numbers getting smaller, and cofactors larger. For
one, that makes the optimal splitting for the hgcd call complicated, one
should take into account not only the size of the numbers, but also the
size of the cofactors. That's done in an adhoc way, different splitting
for the very first iteration, then a fixed size ratio. 

More fundamentally, the multiplications used to build up the cofactors
get more and more unbalanced for each iteration. That doesn't seem
healthy for a divide-and-conquer algorithm.

I suspect a better way is to do it like this: Given A, B of size n,
split off high part, say size n/2 or so. Call hgcd, so we get reduced
numbers a, b, with

 (A;B) = M (a;b)

where M is if size n/4 and a,b of size 3n/4.

Recursively compute gcdext(a,b), to get

  g = s a + t b

Then construct one or both cofactors for 

  g = S A + T B

based on s, t and the matrix M. Then we'll reduce numbers in size as we
recurse down, and build up the cofactors as we return back, without
getting increasingly unbalanced multiply instructions.

But I haven't tried it out.

>   * Try out left-to-right binary hgcd2.
>
> I had not realised you meant left-to-right.  I thought you talked about
> right-to-left.  Interesting!

There are a couple of different cases, I'm afraid. Say we have a of
a_size bits and b of b_size bits.

If a_size == b_size, set

  {a, b} <-- {|a-b|, min(a,b)}

Like a binary step, but we don't divide out powers of two, and we
eliminate at least one bit at the high end.

When a_size > b_size, then to avoid underflow I'm afraid we need 

  a <-- a - 2^{a_size - b_size - 1} b

with progress of 0.5 bits or so. Or we could go half way-towards
computing a quotient by using

  a <-- a - 2^{a_size - b_size} b

in case that is non-negative, which would give us better progress. And
similarly for a_size < b_size.

So a fair amount of logic that needs to be branch free. Updates of u0
and u1 are cheap, but they make any needed conditional swap more
costly.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.