Cofactor canonicalisation of mpn_gcdext

Thu Dec 3 19:10:23 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

  I gave div1 a try, and it actually seemed to make gcdext slower (./speed
  -c -s1 mpn_gcdext on my x86_32 laptop).

32-bit machines typically perform division faster than 64-bit machines.
And in the case of x86, there is a more severe shortage of registers in
the 32-bit instruction set than in the 64-bit instruction set; that will
hurt div1.

  > If you think this code is right, don't be shy: please push it into the
  > main repo and check http://gmplib.org/devel/testmachines.shtml tomorrow.

  Pushed in now.

Thanks!

  Benchmarking on shell, I get (for the old gcdext code):

  $ ./speed -c -s1-5 mpn_gcdext
  overhead 6.06 cycles, precision 10000 units of 4.66e-11 secs, CPU freq 21438.42 MHz
             mpn_gcdext
  1             2337.48
  2             4782.41
  3             7249.04
  4             9705.38
  5            12155.41

  and for the just pushed in code, with one /-division per iteration in gcdext_1, I
  get

  1             1585.38
  2             4659.20
  3             7091.04
  4             9503.12
  5            11924.64

So, on shell we get a speedup.  Cute.

I bet that we do not get a speedup on repentium or panther or tiger.
Only 45nm Core 2 and Core i have early-out dividers, I think.  An
early-out divider will help here.

Small-operand gcd and gcdext are interesting and important things to
optimise.  We might define different approaches, and let tuneup choose
the best one.

-- 
Torbjörn