gcd_11 without abs

Sat Nov 21 15:54:17 UTC 2020

Torbjörn Granlund <tg at gmplib.org> writes:

> The cycle counts are per bit eliminated, it is not cycle counts.

Ok, so they can't be compared directly with "expected" cycles per
iteration.

> I never bothered to try to measure cycle counts.  One would need to
> measure with specially selected operands with known behaviour for the
> algorothm, but that would be fragile.

Or let the benchmark program also call a reference implementation that
produces the iteration count. Cycles per iteration would be helpful in
this context.

>   I wonder if it's possible to get down to 3 cycles. Scheduling the second
>   sub instruction in parallell with the first would be a start.
>
> It might be.  Good luck!  :-)

Don't hold your breath...

> The last few generations of x86 processors have free mov insns.

Sounds good. A move instruction only needs to update the
register-renaming state, it shouldn't issue to any execution unit.

> One thing which I never really explored is to evaluate at least a-b and
> a+b and choose the one which is divisible by 4.  If that cost just one
> extra critical insn, it should be faster.

Hmm. One should aim to have only a single added cmov on the critical
path, and do the add and divisibility test in parallel. 

One difficulty is that the shift count will depend on the choice (while
current code takes advantage of the count being the same for the other
two choices, a-b and b-a). Moving the ctz after the cmovs makes the
critical path longer. To avoid that, one would need to compute both
ctz(a-b) and ctz(a+b). Or even shift two candidate values, |a-b| and
(a+b) in parallel.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.