Schönhage's algorithm for fast integer GCD

Niels Möller nisse@lysator.liu.se
15 Aug 2003 16:48:49 +0200


nisse@lysator.liu.se (Niels Möller) writes:

> Now I've rewritten the code to uses mpn. It seems to works, and it's
> quite clear that it's asymptotically faster.

I've rewritten the code once more, after rereading and finally (I
hope) understanding Jebelean's paper. It's cleaner and a little faster
than the previous version.

Unfortunately, it needs quite a lot of temporary storage, for n-limb
numbers it needs somewhere between 30n -- 100n limbs of temporary
storage. The reason is that at any given time, the algorithm stores
only a short subsequence or window of the complete remainder sequence.
It moves this window downwards (smaller remainders), but it sometimes
needs to backup and reconstruct an earlier, larger, remainder.

The easist way to do that is to store the quotient sequence (needs
O(n) storage). Given the (usually small) quotient, the previous
remainder can be reconstructed in O(n) operations.

I've also implemented a Jebelean/Lehmer gcd algorithm, which I use for
the base case. It seems to be slightly slower than the binary gcd
implementation in the current GMP, at least on x86. At least this
makes it clear how Schönhage differs from Lehmer: In Lehmer
(Jebelean's variant) we chop off the two most significant limbs, while
for Schönhage we chop off half of the limbs.

Timings, with HGCD_LEHMER_THRESHOLD = FGCD_LEHMER_THRESHOLD = 128:

  $ ./bench
  Gcd of fibonacci numbers.
  
  limbs    gcd     fgcd    gcd/fgcd
      1   0.63us   0.83us  0.76             *
      1   0.74us   0.85us  0.87              *
      1   0.80us   0.89us  0.91              *
      1   0.89us   0.89us  0.99              *
      2   0.92us   1.14us  0.81              *
      2   1.41us   2.44us  0.58            *
      3   2.03us   4.27us  0.48           *
      4   3.74us   6.62us  0.57            *
      6   6.73us  10.68us  0.63             *
      8   7.63us  16.43us  0.46           *
     12  13.07us  24.95us  0.52            *
     18  20.73us  38.74us  0.53            *
     27  35.92us  58.07us  0.62            *
     41  62.93us  90.83us  0.69             *
     61 103.09us 150.60us  0.68             *
     91 169.78us 229.36us  0.74             *
    136 320.51us 523.56us  0.61            *
    204 540.54us 917.43us  0.59            *
    305   1.09ms   1.85ms  0.59            *
    458   2.22ms   3.12ms  0.71             *
    686   4.35ms   6.25ms  0.70             *
   1029   8.33ms  11.11ms  0.75             *
   1544  18.33ms  20.00ms  0.92              *
   2316  40.00ms  40.00ms  1.00               *
   3473  85.00ms  70.00ms  1.21               *
   5209 200.00ms 130.00ms  1.54                *
   7814 420.00ms 240.00ms  1.75                 *
  11720 950.00ms 420.00ms  2.26                  *
  17580   2.13s  780.00ms  2.73                   *
  26371   4.81s    1.42s   3.39                    *
  39556  11.10s    2.60s   4.27                     *
  59333  28.82s    4.61s   6.25                      *
  88999  81.95s    8.21s   9.98                        *

The break even point is around 2000 limbs, and it wins by a factor of
ten for numbers larger than about 100000 limbs.

gprof top ten:

   %   cumulative   self              self     total           
  time   seconds   seconds    calls  ms/call  ms/call  name    
  32.45      0.68     0.68 21887061     0.00     0.00  div2
  24.04      1.18     0.50   512268     0.00     0.00  hgcd2
   9.13      1.36     0.19   406712     0.00     0.00  mpn_gcd_jebelean
   6.73      1.50     0.14 14188759     0.00     0.00  mpn_quotient_stack_push_1
   5.77      1.62     0.12  1024536     0.00     0.00  mpn_addsubmul_1
   4.57      1.72     0.10  1287284     0.00     0.00  mpn_addmul2_1
   3.85      1.80     0.08   959009     0.00     0.00  jhgcd2_check_q
   3.37      1.87     0.07   321821     0.00     0.00  mpn_jhgcd_mul_1

Note that div2 and hgcd2 are the real workhorses. Unlike the previos
version, almost no time is spent doing euclid steps.

I've put the code at
http://www.lysator.liu.se/~nisse/misc/gcd-0.3.tar.gz. The archive is a
little messy, as it includes several variants of the schönhage
algorithm. The current code is in the file jebelean_mpn.c.

Some further work:

 * Find a bound on the number of backup steps, and don't store any
   more quotients than that.

 * Figure out if it is possible to use Schönhage's and Jebelean's
   ideas together with a basecase that uses the binary gcdext
   algorithm. 

Regards,
/Niels