gcd_11 without abs

Sat Nov 21 15:10:16 UTC 2020

Niels Möller <nisse at lysator.liu.se> writes:

  And then there will be some conditional operations somewhere. For
  shortest path, it's best if the code can be arranged to do those
  operations in parallel with count trailing zeros.

  Current ARM code (I'm looking at the v6t2 version) does that, with two
  conditional instructions (rsbcc, movcs) independent of the ctz. But then
  count trailing zeros is done with two depending instructions, rbit, clz,
  so the critical path is then 4 instructions. Unclear to me why it runs
  faster than 4 cycles on some processors, maybe the rbit+clz combination
  is executed in a single step on some of the execution units?

The cycle counts are per bit eliminated, it is not cycle counts.

I never bothered to try to measure cycle counts.  One would need to
measure with specially selected operands with known behaviour for the
algorothm, but that would be fragile.  I mean, slight algorithm
variations could reduce slightly less or slightly more bits per
iteration.

  Current x86_64 code (I'm looking at the core2 version) also looks pretty
  tight, also with all conditional moves independent of the count trailing
  zeros instruction. Dependency graph is more complex, but as I read it it
  depth would be 4, with each of these groups of instructions executed in
  parallel in one cycle:

          sub     u0, %rdx                C v - u

          bsf     %rdx, %rcx
          mov     u0, %rax
          sub     v0, u0                  C u - v    <----

          jnz     L(top)
  L(top): cmovc   %rdx, u0                C u = |u - v|
          cmovc   %rax, v0                C v = min(u,v)

          shr     R8(%rcx), u0
  L(odd): mov     v0, %rdx

  I wonder if it's possible to get down to 3 cycles. Scheduling the second
  sub instruction in parallell with the first would be a start.

It might be.  Good luck!  :-)

The last few generations of x86 processors have free mov insns.
Therefore, one can mimic 3-operand operations with a mov; op pair.  That
might help.

  Back to the no-abs version. That seems to require cmovs on the critical
  path *after* the shift, and that would tend to make the critical path
  longer.

The current code got a lot of attention a year ago.  It will not be easy
to improve it.

One thing which I never really explored is to evaluate at least a-b and
a+b and choose the one which is divisible by 4.  If that cost just one
extra critical insn, it should be faster.

-- 
Torbjörn
Please encrypt, key id 0xC8601622