Torbjörn Granlund tg at gmplib.org
Sat Sep 7 11:17:05 UTC 2019

I'm a bit surprised that the METHOD 3 div1 didn't yield more speedup.
The hgcd2 code is almost content with a 25 cycle / operation, which is
really, really odd to me.

The quotient is < 8 in over 80% of the time.  It is < 4 almost 70% of
the time.  Perhaps we should aim for handling just < 4 to save some

Then, on the other hand, we could make a METHOD 3 div1 in assembly which
should be somewhat quicker.

So if out efforts to improve div1 doesn't make wonders, perhaps we
should look into a better variant of Euclid's algorithm for developing
the quotient chain?

To start with, we could compare a mod b with b - a mod b and choose
whichever is smaller.

Then we could compare (a mod b)/2^s with (b - a mod b)/2^t where s and t
are the number of trailing bits in respective expression.

Reducing the number of overall hgcd2 iterations should really make a

But perhaps this would make the 2x2 matrix computation harder, and
require signed numbers, which would in turn complicate things too much?

We could then perhaps find some middle ground which doesn't try to find
the absolutely minimal remainder, but one which avoids negative numbers
which hurts...

(I haven't studied hgcd2 well enough to answer these questions myself,

Torbjörn Please encrypt, key id 0xC8601622

More information about the gmp-devel mailing list