gcd_22

Mon Aug 19 22:32:35 UTC 2019

We now have the algorithm

  while (highlimb(U) | highlimb(V)) != 0
    S = U - V
    T = V - U
    cnt = count_trailing_zeros(lowlimb(S))
    cnd1 = U < V
    if (cnd1)
      V = U
      U = T
    else
      U = S
    U >>= cnt

with some variation.  The "if" statement is performed with masking
tricks or conditional move/conditional select.

The main performance problem with that code is the dependency on
lowlimb(U) in the conditional assignment, through the double-limb shift,
to the S and T assignments.  The looping criteria is a smaller problem,
as branch prediction takes care of that.

I think we might get more speed from this algorithm:

  while (highlimb(U) | highlimb(V)) != 0
    S = U - V
    T = V - U
    cnt = count_trailing_zeros(lowlimb(S))
    cnd1 = U < V
    cnd2 = highlimb(U) < highlimb(V)
    if (cnd2)
      V = U
      U = T
    else
      U = S
    if (cnd1 != cnd2) { fix mistake! }
    U >>= cnt;

Why would it be faster?  Because while cnd1 needs to wait for the
previous iteration's double-limb shifting, cnd2 depends on just the
single-limb shifting of the high U world.

We might make the wrong decision iff highlimb(U) = highlimb(V) in
which case cnd1 might be true (cnd2 is false of course).

But when highlimb(U) = highlimb(V) then either highlimb(U-V) = 0 or
highlimb(V-U) = 0, so we're about to go into a gcd_21 situation.  (We
might also have highlimb(U-V) = highlimb(V-U) = 0, in which case we're
done completely.)

With some inspiration of Richard Sites, Alpha architect, I say: It's
the critical path, stupid!  :-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622