gcd_11 asm

Fri Aug 23 08:03:10 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  I've had a look at the latest gcd_11 asm, and it's really neat,
  including naturally getting %rdx zero on return.

  One question: the bd2 and bd4 versions use

  L(top): rep;bsf %rdx, %rcx  C tzcnt!

  I've not seen this before, but a quick web search indicates that tzcnt
  is the same as bsf, except that it has a well defined result also when
  the input is zero. But in these loops, we should get to this instruction
  only for non-zero %rdx. So are there any other subtleties?

That, and that the flags are set quite differently.

But the code cares neiter about the flags nor aboubt what might or might
not happen for zero input.

The really weird thing is that tzcnt, encoded identically to rep;bsf is
much faster than a bare bsf on AMD platforms.  One would think (1) an
instruction B which works like instruction A except that B has some
additional corner case requirements would not be faster than A, and (2)
why didn't they just make instruction A faster and also defined it for
0?, and (3) why didn't they make bsf faster too, it should be trivial.

You might say "but thanks to rep;bsf we KNOW that it is well-defined for
0".  Good point.  Except that rep;bsf will run just like bsf on any
older x86 CPU out there, i.e. be undefined for 0.  Therefore safe use of
rep;bsf requires knowldge of whether this instruction is specifically
handled as tzcnt (which asks for e.g. running the cpuid instruction).
Conclusion: silently making bsf well-defined is really the same as
silently making rep;bsf well-defined (except that the latter has a byte
longer encoding).

The performance difference is huge.  E.g., AMD Ryzen can execute 6 times
more rep;bsf than plain bsf per cycle, each with 2/3 of the latency.

See also: https://gmplib.org/~tege/x86-timing.pdf

-- 
Torbjörn
Please encrypt, key id 0xC8601622