Revisiting gcd and gcdext for small operands

Fri Dec 1 15:00:31 UTC 2017

tg at gmplib.org (Torbjörn Granlund) writes:

> Nisse and other great common dividers!
>
> Please consider these measurements:
>
> ibr32$ ./tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
>               mpn_mul       mpn_gcd    mpn_gcdext
> 1              #39.14        6.5512       17.1607
> 2              #50.77       12.4422       51.4993
> 3              #66.27       36.0216       57.9561
> 4              #87.17       40.7439       58.6398
> 5             #110.95       41.7779       57.3457
> 6             #136.56       42.4274       54.8383
> 7             #165.67       42.7125       51.9271
> 8             #205.14       39.7861       47.5393
> 9             #249.47       37.0144       42.8386
> 10            #289.77       36.0182       40.6772
> 12            #398.58       32.2387       34.2015
> 14            #516.54       28.4706       28.4355
> 16            #653.70       25.3546       25.7443
> 19            #887.40       21.4907       20.0516
> 22           #1196.18       17.9833       17.4368
> 26           #1617.71       15.1357       15.3959
> 31           #2196.36       12.3736       13.3466
> 37           #2963.45        9.1705       12.3230
> 44           #3970.82        9.4049       11.9397
> 52           #5309.94        7.8460       11.6024
>
> ibr64# tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
>               mpn_mul       mpn_gcd    mpn_gcdext
> 1              #34.51        9.5965       43.4851
> 2              #41.46       24.7985       91.1526
> 3              #63.37       60.3494       89.9912
> 4              #80.39       69.5938       92.7650
> 5             #105.76       69.8544       88.4005
> 6             #134.70       68.6191       83.3540
> 7             #173.62       64.7691       74.9246
> 8             #208.92       62.9363       72.0671
> 9             #262.49       56.4922       63.4748
> 10            #314.04       52.3732       59.2200
> 12            #437.50       46.1083       50.0230
> 14            #578.39       41.1333       42.6137
> 16            #714.52       38.2636       38.8416
> 19           #1021.12       32.3526       30.0309
> 22           #1274.16       29.6524       28.3373
> 26           #1653.53       26.4316       25.1063
> 31           #2283.67       22.3448       20.5088
> 37           #3185.51       16.2611       14.3919
> 44           #4003.10       16.8047       15.6203
> 52           #5287.91       12.4089       16.1354
>
> Without detail knowledge of the current implementation, these numbers
> suggest to me that we could speed small operand gcd and gcdext.  But I
> might be wrong.

Hmm. I'd reason like this: All three functions are expected to be
quadratic for small sizes. And then mul should become faster relative to
the others above the Karatsuba threshold, since the gcd thresholds are
much higher then the Karatsube threshold.

But above numbers don't look like that at all. Which might indicate that
the linear term of the complexity is *much* higher for gcd than for mul.

Do I read the data the same way as you?

> Would asm "binary algorithm" code also make sense for, say, n = 10?  I
> have some doubt, as C code around the current (asm) primitives add_n,
> sub_n, rshift probably is close to optimal.

I'd guess that assembly code for small sizes, which keeps the operands
in registers, could bring some speedup. Could start with just n=2, to
see what the gain is.

But before doing that beyond n = 2, I think we should consider an
assembly implementation of hgcd2. That's the main linear part of
gcd-complexity.

> I had an idea now for using a Hensel norm Euclidean algorithm.  Finding
> a^(-1) mod 2^b where b is the limb size in the inner loop would be
> expensive, but how about letting b = 8 or thereabout?  Then a table
> lookup would work for the inverses.  Does that make sense?

Say we compute a k-bit hensel quotient, and replace (a, b) by ( (a + q
b)/2^k, b). Then we get no additional progress by making k larger than
the bitsize difference between a and b; making q b much larer than a is
useless.

So just like the plain Euclidean gcd, useful quotients are pretty small
on average. To get more progress than one or two bits per step, one
needs either a matrix (representing multiple division steps), or a more
general linear combination a <-- k a + m b (like the "accelerated gcd"
by Ken Weber, and with some need to handle spurious factors).

One can gain one bit by using signed values, though, and choose q so
that |q| < 2^{k-1}). Putting k = 3 (as you suggest) and |q| < 4 might
give a speedup over plain binary. If it can be implemented to keep
branches to a minimum.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.