Torbjörn Granlund tg at gmplib.org
Fri Aug 23 09:13:06 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  The below implementation appears to pass tests, and give a modest
  speedup of 0.2 cycles per input bit, or 2.5%. Benchmark, comparing C
  implementations of gcd_11 and gcd_22:

Beware of "turbo" when counting cycles!  (Relative measurements like
gcd_11 vs gcd_22 should be fine!)

The speed difference between C gcd_11 and gcd_22 is surprisingly small!
Perhaps gcd_11 should be rewritten in the style of gcd_22?

I did not provide a top-level gcd_22 for x86_64 as you might have seen.
The one similar to x86_64/gcd_11.asm is probably x86_64/k8/gcd_22.asm.
Perhaps it should be moved.

But as far as I can tell, that function is slower than you C gcd_22 for
some platforms, such as Intel haswell.

I'm curious if your C code could be made into competitive asm.  One
usually can beat the compiler some 10-30%.

Measurements for gcd_11/22 for most of our machines are in.  See
https://gmplib.org/devel/tm/gmp/date.html and click on any HOSTgentoo64
tuneup link.  Scroll down; after the normal *_THRESHOLD stuff comes
comparative measurements of asm code.  (The mpn/generic code is not
usually measured; the exception is when it appears in the default
column.  I plan to fix this some day, and have a few columns "gcc -O",
"gcc -Os", "gcc -O2".)

Please encrypt, key id 0xC8601622

More information about the gmp-devel mailing list