Cofactor canonicalisation of mpn_gcdext

Torbjorn Granlund tg at
Thu Dec 3 19:10:23 CET 2009

nisse at (Niels Möller) writes:

  I gave div1 a try, and it actually seemed to make gcdext slower (./speed
  -c -s1 mpn_gcdext on my x86_32 laptop).
32-bit machines typically perform division faster than 64-bit machines.
And in the case of x86, there is a more severe shortage of registers in
the 32-bit instruction set than in the 64-bit instruction set; that will
hurt div1.

  > If you think this code is right, don't be shy: please push it into the
  > main repo and check tomorrow.
  Pushed in now.


  Benchmarking on shell, I get (for the old gcdext code):
  $ ./speed -c -s1-5 mpn_gcdext
  overhead 6.06 cycles, precision 10000 units of 4.66e-11 secs, CPU freq 21438.42 MHz
  1             2337.48
  2             4782.41
  3             7249.04
  4             9705.38
  5            12155.41
  and for the just pushed in code, with one /-division per iteration in gcdext_1, I
  1             1585.38
  2             4659.20
  3             7091.04
  4             9503.12
  5            11924.64
So, on shell we get a speedup.  Cute.

I bet that we do not get a speedup on repentium or panther or tiger.
Only 45nm Core 2 and Core i have early-out dividers, I think.  An
early-out divider will help here.

Small-operand gcd and gcdext are interesting and important things to
optimise.  We might define different approaches, and let tuneup choose
the best one.


More information about the gmp-devel mailing list