Schönhage's algorithm for fast integer GCD
Niels Möller
nisse@lysator.liu.se
15 Aug 2003 16:48:49 +0200
nisse@lysator.liu.se (Niels Möller) writes:
> Now I've rewritten the code to uses mpn. It seems to works, and it's
> quite clear that it's asymptotically faster.
I've rewritten the code once more, after rereading and finally (I
hope) understanding Jebelean's paper. It's cleaner and a little faster
than the previous version.
Unfortunately, it needs quite a lot of temporary storage, for n-limb
numbers it needs somewhere between 30n -- 100n limbs of temporary
storage. The reason is that at any given time, the algorithm stores
only a short subsequence or window of the complete remainder sequence.
It moves this window downwards (smaller remainders), but it sometimes
needs to backup and reconstruct an earlier, larger, remainder.
The easist way to do that is to store the quotient sequence (needs
O(n) storage). Given the (usually small) quotient, the previous
remainder can be reconstructed in O(n) operations.
I've also implemented a Jebelean/Lehmer gcd algorithm, which I use for
the base case. It seems to be slightly slower than the binary gcd
implementation in the current GMP, at least on x86. At least this
makes it clear how Schönhage differs from Lehmer: In Lehmer
(Jebelean's variant) we chop off the two most significant limbs, while
for Schönhage we chop off half of the limbs.
Timings, with HGCD_LEHMER_THRESHOLD = FGCD_LEHMER_THRESHOLD = 128:
$ ./bench
Gcd of fibonacci numbers.
limbs gcd fgcd gcd/fgcd
1 0.63us 0.83us 0.76 *
1 0.74us 0.85us 0.87 *
1 0.80us 0.89us 0.91 *
1 0.89us 0.89us 0.99 *
2 0.92us 1.14us 0.81 *
2 1.41us 2.44us 0.58 *
3 2.03us 4.27us 0.48 *
4 3.74us 6.62us 0.57 *
6 6.73us 10.68us 0.63 *
8 7.63us 16.43us 0.46 *
12 13.07us 24.95us 0.52 *
18 20.73us 38.74us 0.53 *
27 35.92us 58.07us 0.62 *
41 62.93us 90.83us 0.69 *
61 103.09us 150.60us 0.68 *
91 169.78us 229.36us 0.74 *
136 320.51us 523.56us 0.61 *
204 540.54us 917.43us 0.59 *
305 1.09ms 1.85ms 0.59 *
458 2.22ms 3.12ms 0.71 *
686 4.35ms 6.25ms 0.70 *
1029 8.33ms 11.11ms 0.75 *
1544 18.33ms 20.00ms 0.92 *
2316 40.00ms 40.00ms 1.00 *
3473 85.00ms 70.00ms 1.21 *
5209 200.00ms 130.00ms 1.54 *
7814 420.00ms 240.00ms 1.75 *
11720 950.00ms 420.00ms 2.26 *
17580 2.13s 780.00ms 2.73 *
26371 4.81s 1.42s 3.39 *
39556 11.10s 2.60s 4.27 *
59333 28.82s 4.61s 6.25 *
88999 81.95s 8.21s 9.98 *
The break even point is around 2000 limbs, and it wins by a factor of
ten for numbers larger than about 100000 limbs.
gprof top ten:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
32.45 0.68 0.68 21887061 0.00 0.00 div2
24.04 1.18 0.50 512268 0.00 0.00 hgcd2
9.13 1.36 0.19 406712 0.00 0.00 mpn_gcd_jebelean
6.73 1.50 0.14 14188759 0.00 0.00 mpn_quotient_stack_push_1
5.77 1.62 0.12 1024536 0.00 0.00 mpn_addsubmul_1
4.57 1.72 0.10 1287284 0.00 0.00 mpn_addmul2_1
3.85 1.80 0.08 959009 0.00 0.00 jhgcd2_check_q
3.37 1.87 0.07 321821 0.00 0.00 mpn_jhgcd_mul_1
Note that div2 and hgcd2 are the real workhorses. Unlike the previos
version, almost no time is spent doing euclid steps.
I've put the code at
http://www.lysator.liu.se/~nisse/misc/gcd-0.3.tar.gz. The archive is a
little messy, as it includes several variants of the schönhage
algorithm. The current code is in the file jebelean_mpn.c.
Some further work:
* Find a bound on the number of backup steps, and don't store any
more quotients than that.
* Figure out if it is possible to use Schönhage's and Jebelean's
ideas together with a basecase that uses the binary gcdext
algorithm.
Regards,
/Niels