Revisiting gcd and gcdext for small operands
Niels Möller
nisse at lysator.liu.se
Fri Dec 1 15:00:31 UTC 2017
tg at gmplib.org (Torbjörn Granlund) writes:
> Nisse and other great common dividers!
>
> Please consider these measurements:
>
> ibr32$ ./tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
> mpn_mul mpn_gcd mpn_gcdext
> 1 #39.14 6.5512 17.1607
> 2 #50.77 12.4422 51.4993
> 3 #66.27 36.0216 57.9561
> 4 #87.17 40.7439 58.6398
> 5 #110.95 41.7779 57.3457
> 6 #136.56 42.4274 54.8383
> 7 #165.67 42.7125 51.9271
> 8 #205.14 39.7861 47.5393
> 9 #249.47 37.0144 42.8386
> 10 #289.77 36.0182 40.6772
> 12 #398.58 32.2387 34.2015
> 14 #516.54 28.4706 28.4355
> 16 #653.70 25.3546 25.7443
> 19 #887.40 21.4907 20.0516
> 22 #1196.18 17.9833 17.4368
> 26 #1617.71 15.1357 15.3959
> 31 #2196.36 12.3736 13.3466
> 37 #2963.45 9.1705 12.3230
> 44 #3970.82 9.4049 11.9397
> 52 #5309.94 7.8460 11.6024
>
> ibr64# tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
> mpn_mul mpn_gcd mpn_gcdext
> 1 #34.51 9.5965 43.4851
> 2 #41.46 24.7985 91.1526
> 3 #63.37 60.3494 89.9912
> 4 #80.39 69.5938 92.7650
> 5 #105.76 69.8544 88.4005
> 6 #134.70 68.6191 83.3540
> 7 #173.62 64.7691 74.9246
> 8 #208.92 62.9363 72.0671
> 9 #262.49 56.4922 63.4748
> 10 #314.04 52.3732 59.2200
> 12 #437.50 46.1083 50.0230
> 14 #578.39 41.1333 42.6137
> 16 #714.52 38.2636 38.8416
> 19 #1021.12 32.3526 30.0309
> 22 #1274.16 29.6524 28.3373
> 26 #1653.53 26.4316 25.1063
> 31 #2283.67 22.3448 20.5088
> 37 #3185.51 16.2611 14.3919
> 44 #4003.10 16.8047 15.6203
> 52 #5287.91 12.4089 16.1354
>
> Without detail knowledge of the current implementation, these numbers
> suggest to me that we could speed small operand gcd and gcdext. But I
> might be wrong.
Hmm. I'd reason like this: All three functions are expected to be
quadratic for small sizes. And then mul should become faster relative to
the others above the Karatsuba threshold, since the gcd thresholds are
much higher then the Karatsube threshold.
But above numbers don't look like that at all. Which might indicate that
the linear term of the complexity is *much* higher for gcd than for mul.
Do I read the data the same way as you?
> Would asm "binary algorithm" code also make sense for, say, n = 10? I
> have some doubt, as C code around the current (asm) primitives add_n,
> sub_n, rshift probably is close to optimal.
I'd guess that assembly code for small sizes, which keeps the operands
in registers, could bring some speedup. Could start with just n=2, to
see what the gain is.
But before doing that beyond n = 2, I think we should consider an
assembly implementation of hgcd2. That's the main linear part of
gcd-complexity.
> I had an idea now for using a Hensel norm Euclidean algorithm. Finding
> a^(-1) mod 2^b where b is the limb size in the inner loop would be
> expensive, but how about letting b = 8 or thereabout? Then a table
> lookup would work for the inverses. Does that make sense?
Say we compute a k-bit hensel quotient, and replace (a, b) by ( (a + q
b)/2^k, b). Then we get no additional progress by making k larger than
the bitsize difference between a and b; making q b much larer than a is
useless.
So just like the plain Euclidean gcd, useful quotients are pretty small
on average. To get more progress than one or two bits per step, one
needs either a matrix (representing multiple division steps), or a more
general linear combination a <-- k a + m b (like the "accelerated gcd"
by Ken Weber, and with some need to handle spurious factors).
One can gain one bit by using signed values, though, and choose q so
that |q| < 2^{k-1}). Putting k = 3 (as you suggest) and |q| < 4 might
give a speedup over plain binary. If it can be implemented to keep
branches to a minimum.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list