Improved gcd_1 code

Tue Mar 13 10:25:51 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
  > Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.

  What algorithm do you use?

Plain binary.  It is similar to the old code (which traces back to Kevin
Ryde's k7 code) except that it does not use any tables or tight loops
for dealing with trailing zeros.

  > Before counting the # of trailing zero bits, we need to wait for two
  > cmove to complete.  On Intel, cmove is a serial instruction with a 2
  > cycle latency.  It could be possible to compute ctz(a-b) instead of
  > ctz(|a-b|) in order to run bsf an cmove in parallel.

  The C code (for GCD_1_METHOD == 2) does something like that. But with
  masking tricks instead of cmov, of course.

Looks promising.  GCD_1_METHOD seems to be read-only, but method 2 is
the default.

I think a 16 byte zerotab is too small, and should be expanded.  (But if
a tiny tab is to be used, I suppose extraction from a magic limb
constant is better.)  Well, zerotab is unconditionally disabled now and
a loop is used.

I suppose we should make use of longlong for trailing zeros counting,
since both looping (as done now) and suitably sized tables are slower on
many machines that dedicated insns.  But count_trailing_zeros is
unsuitable in its current form, a count_trailing_zeros_gcd should be
added which is to be optimised for small counts.

-- 
Torbjörn