Improved gcd_1 code
tg at gmplib.org
Mon Mar 12 23:17:15 CET 2012
I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.
The actual code lives in x86_64/core2, other CPUs either inherit it or
grab it from there.
Additional improvements are welcome.
Before counting the # of trailing zero bits, we need to wait for two
cmove to complete. On Intel, cmove is a serial instruction with a 2
cycle latency. It could be possible to compute ctz(a-b) instead of
ctz(|a-b|) in order to run bsf an cmove in parallel.
This should work as |a-b| and a-b always have the same number of
trailing zero bits.
More information about the gmp-devel