Improved gcd_1 code

Torbjorn Granlund tg at
Tue Mar 13 20:55:28 CET 2012

nisse at (Niels Möller) writes:

  Torbjorn Granlund <tg at> writes:
  > A properly sized table will beat any count_trailing_zeros which takes
  > more than about 5 cycles.
  Ok. I take it this is still in in the gcd context, where the count is
  expected to be small?

  Not useful in the gcd context (where the count is expected to be small),
  possibly never.
I suspect something like this is better:

   c = 0
   if ((x & 0xffffffff == 0)
     c = 32;
   x >>= c;
   sum = c;

   c = 0
   if ((x & 0xffff == 0)
     c = 16;
   x >>= c;
   sum += c;

   c = 0
   if ((x & 0xff == 0)
     c = 8;
   x >>= c;
   sum += c;


  >   > a count_trailing_zeros_gcd should be added which is to be optimised
  >   > for small counts.
  By the way, are there any uses of count_trailing_zeros where the result
  is *not* expected to be small, for random inputs to the calling
  function? Maybe gcd is not so special?
No idea.  Try: grep -rl count_trailing_zeros gmpdir

I pushed a slightly improved gcd_1 with cmov and bsf now running in
parallel.  All Intel processors benefit, since their cmov have a
throughput of 1/3 to 1/6 compared to K10, and a teice as long latency.
AMD processors seem unaffacted.

I am sure te code can be further improved with some scheduling.  It now
relies on out-of-order execution more than needed.


More information about the gmp-devel mailing list