General mpn_gcd_basecase

Sat Aug 31 17:07:26 UTC 2019

  > Sounds like the opposite of binary euclid.

  Sounds more interesting than binary euclid, to me. Because looking at the
  lowest bits and just detecting the bit size seems easier than extracting
  the highest bits each loop.

I suppose I don't fully understand how to make a table-based binary
euclid with good progress.  A binary Euclid approaches 3 bits/iteration,
which is what we get without the latest scaling trick for the binary
table gcd.

I'd love to see a fast binary euclid.

I have now added statistics collection to my latest code.  It turns out
that for equal size operands, the addmul_1 case is executed some 95% of
the time.  The scaling trick only applies to the submul_1 cases.  The
result is 2 bits/iteration (with 256-byte tables).  :-(

In order to trigger the submul arm, I tested with operands of quite
different sizes.  Now I get close to 6 bits per iteration, which of
course is awesome.

We need to think more abour this.  With my present algorithm I seem to
get almost 3 times better progress from gcd(2^64*U+1,V) than from
gcd(U,V) when U and V are close in size.  The former triggers the
scaling submul_1 code while the latter gets stuck in the addmul_1 code.

For similar-size operands, the scaling submul_1 code gets used just when
ether operand by chance happen the geta large size reduction.

-- 
Torbjörn
Please encrypt, key id 0xC8601622