Revisiting gcd and gcdext for small operands
Torbjörn Granlund
tg at gmplib.org
Fri Dec 1 13:25:22 UTC 2017
Nisse and other great common dividers!
Please consider these measurements:
ibr32$ ./tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
mpn_mul mpn_gcd mpn_gcdext
1 #39.14 6.5512 17.1607
2 #50.77 12.4422 51.4993
3 #66.27 36.0216 57.9561
4 #87.17 40.7439 58.6398
5 #110.95 41.7779 57.3457
6 #136.56 42.4274 54.8383
7 #165.67 42.7125 51.9271
8 #205.14 39.7861 47.5393
9 #249.47 37.0144 42.8386
10 #289.77 36.0182 40.6772
12 #398.58 32.2387 34.2015
14 #516.54 28.4706 28.4355
16 #653.70 25.3546 25.7443
19 #887.40 21.4907 20.0516
22 #1196.18 17.9833 17.4368
26 #1617.71 15.1357 15.3959
31 #2196.36 12.3736 13.3466
37 #2963.45 9.1705 12.3230
44 #3970.82 9.4049 11.9397
52 #5309.94 7.8460 11.6024
ibr64# tune/speed -p10000000 -c -s1-60 -f1.2 -r mpn_mul mpn_gcd mpn_gcdext
mpn_mul mpn_gcd mpn_gcdext
1 #34.51 9.5965 43.4851
2 #41.46 24.7985 91.1526
3 #63.37 60.3494 89.9912
4 #80.39 69.5938 92.7650
5 #105.76 69.8544 88.4005
6 #134.70 68.6191 83.3540
7 #173.62 64.7691 74.9246
8 #208.92 62.9363 72.0671
9 #262.49 56.4922 63.4748
10 #314.04 52.3732 59.2200
12 #437.50 46.1083 50.0230
14 #578.39 41.1333 42.6137
16 #714.52 38.2636 38.8416
19 #1021.12 32.3526 30.0309
22 #1274.16 29.6524 28.3373
26 #1653.53 26.4316 25.1063
31 #2283.67 22.3448 20.5088
37 #3185.51 16.2611 14.3919
44 #4003.10 16.8047 15.6203
52 #5287.91 12.4089 16.1354
Without detail knowledge of the current implementation, these numbers
suggest to me that we could speed small operand gcd and gcdext. But I
might be wrong.
The relative slowness of gcd/gcdext vs mul is worse for 64-bit limbs
(2nd table) than for 32-bit limbs (1st table).
I assume that assembly basecase code using the "binary algorithm" would
help at least for n = 2 and up to some limit. Such code exists now just
for n = 1, IIRC (and that code, mpn_gcd_1, allows one operand of
arbitrary size, for historical reasons).
Would asm "binary algorithm" code also make sense for, say, n = 10? I
have some doubt, as C code around the current (asm) primitives add_n,
sub_n, rshift probably is close to optimal. Sure, one could pwerhaps
use straight-line code for add_n/sub_n/rshift for some values of n, and
that could perhaps give a 2x speedup, at the expense of lot of asm code.
I had an idea now for using a Hensel norm Euclidean algorithm. Finding
a^(-1) mod 2^b where b is the limb size in the inner loop would be
expensive, but how about letting b = 8 or thereabout? Then a table
lookup would work for the inverses. Does that make sense?
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list