Improved gcd_1 code

Mon Mar 12 23:17:15 CET 2012

I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.

The actual code lives in x86_64/core2, other CPUs either inherit it or
grab it from there.

Additional improvements are welcome.

One idea:

Before counting the # of trailing zero bits, we need to wait for two
cmove to complete.  On Intel, cmove is a serial instruction with a 2
cycle latency.  It could be possible to compute ctz(a-b) instead of
ctz(|a-b|) in order to run bsf an cmove in parallel.

This should work as |a-b| and a-b always have the same number of
trailing zero bits.

-- 
Torbjörn