Improved gcd_1 code

Tue Mar 13 09:58:47 CET 2012

Torbjorn Granlund <tg at gmplib.org> writes:

> I pushed some new gcd_1 code for x86_64, helping AMD k10 and bulldozer,
> Intel conroe/penryn, nehalem/westmere, and sandybridge, and VIA nano.

What algorithm do you use?

> Before counting the # of trailing zero bits, we need to wait for two
> cmove to complete.  On Intel, cmove is a serial instruction with a 2
> cycle latency.  It could be possible to compute ctz(a-b) instead of
> ctz(|a-b|) in order to run bsf an cmove in parallel.

The C code (for GCD_1_METHOD == 2) does something like that. But with
masking tricks instead of cmov, of course.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.