Likely GMP bug

Torbjörn Granlund tg at gmplib.org
Wed May 30 13:41:37 UTC 2018


nisse at lysator.liu.se (Niels Möller) writes:

  But you may be right that it's important for performance to avoid a
  redundant count_trailing_zeros on u.

It is slow on several machines, but it has become more common to provide
good leading/trailing bit count for newer microarchs.

  It seems tricky to avoid that, without code duplication or complications
  in the code flow. For my personal preference, tradeoff threshold between
  goto and duplicated code is around 5-10 duplicated code lines.

  Thinking about micro optimizations... Consider

  >       count_trailing_zeros (c, ulimb);
  >       ulimb = (ulimb >> 1) >> c;

  vs

  >       count_trailing_zeros (c, ulimb);
  >       ulimb >>= (c + 1);

Right shift has OK performance almost everywhere.  (And obsolete
exceptions is x86-64 Pentium 4 where it takes some 8 cycles and stalls
the pipeline.)

On newer Intel x86-64 shifting is fast for immediate counts but takes 2
cycles with 1/2 thoughput for dynamic counts.  The latest generations
provides a new set of shifts for dynamic counts which are fast,
e.g. shrx.

AMD chips have great right shift performance with no exceptions.

No RISCs have any problems here.

  I wonder if maybe it's preferable to always use the first variant? Both
  should compile to three instructions (assuming we have a ctz
  instruction), ctz + shift + shift, or ctz + add + shift. But with the
  first variant, the first shift can be issued in parallel with ctz,
  reducing the critical path of the gcd loop.

  We'd need some benchmarking to see if it makes a difference.

Suppressing redundant count_trailing_zeros should make more difference.

Ref: https://gmplib.org/~tege/x86-timing.pdf

-- 
Torbjörn
Please encrypt, key id 0xC8601622


More information about the gmp-bugs mailing list