Schönhage's algorithm for fast integer GCD
   
    Niels Möller
     
    nisse@lysator.liu.se
       
    03 Jul 2003 14:38:27 +0200
    
    
  
Now I've rewritten the code to uses mpn. It seems to works, and it's
quite clear that it's asymptotically faster. I hope it is possible to
increase the speed at least by a factor of two by optimizing two or
three well chosen functions. Below is a benchmark of the current code.
It was run with thresholds in the range 32-64 limbs, running on x86,
and using fibonacci numbers (which are worst case for Euclid's
algorithm with maximum number of euclid steps, and all quotients = 1).
$ ./bench
Gcd of fibonacci numbers.
limbs  gcd  fgcd  gcd/fgcd
    1  0.71us  0.82us  0.87              *
    1  0.68us  0.81us  0.84              *
    1  0.91us  0.99us  0.92              *
    1  0.76us  0.93us  0.82              *
    2  1.02us  1.05us  0.97              *
    2  1.28us  1.39us  0.92              *
    3  2.06us  2.00us  1.03               *
    4  3.81us  3.81us  1.00              *
    6  7.16us  7.26us  0.99              *
    8  7.97us  8.98us  0.89              *
   12 14.99us 13.66us  1.10               *
   18 20.85us 21.49us  0.97              *
   27 35.24us 34.20us  1.03               *
   41 59.59us 59.00us  1.01               *
   61 97.75us 96.71us  1.01               *
   91 186.22us 591.72us  0.31         *
  136 321.54us  1.35ms  0.24        *
  204 584.80us  2.00ms  0.29         *
  305  1.11ms  4.00ms  0.28         *
  458  2.22ms  5.88ms  0.38          *
  686  4.00ms 11.11ms  0.36          *
 1029  8.33ms 20.00ms  0.42           *
 1544 18.33ms 33.33ms  0.55            *
 2316 36.67ms 60.00ms  0.61            *
 3473 90.00ms 100.00ms  0.90              *
 5209 190.00ms 180.00ms  1.06               *
 7814 420.00ms 300.00ms  1.40                *
11720 950.00ms 530.00ms  1.79                 *
17580  2.12s 950.00ms  2.23                  *
26371  4.80s  1.67s  2.87                   *
39556 10.82s  2.98s  3.63                    *
59333 28.40s  5.16s  5.50                      *
gprof output:
 43.37      0.85     0.85   195586     0.00     0.00  hgcd_2
 13.52      1.11     0.27  2038248     0.00     0.00  hgcd_update_1_ui
 12.24      1.35     0.24  2075205     0.00     0.00  euclid_step
  7.40      1.50     0.14   195586     0.00     0.00  hgcd_2_update
  3.57      1.57     0.07       32     2.19    60.78  bench_fgcd
  2.55      1.62     0.05   782344     0.00     0.00  hgcd_2_abs
[...]
:    0.00      1.96     0.00       32     0.00     0.00  display
Here, hgcd_2 is the work horse for Lehmer steps, and euclid_step
together with hgcd_update_1_ui is the work horse for euclid steps. Two
thirds of the cpu time is spent in those three functions.
The code is available at
http://www.lysator.liu.se/~nisse/misc/gcd-0.3.tar.gz. It needs some
more of both cleanup and optimization.
Regards,
/Niels