hgcd1/2

Tue Sep 3 10:28:43 UTC 2019

nisse at lysator.liu.se (Niels Möller) writes:

  In that case, not so surprising that the div1 function loses. Do other
  architectures also have decent performance for small-quotient division?

  Do you think table lookup on high bits should beat 16 cycles? It needs
  to give good enough accuracy (possibly with an adjustment step) to not
  result in a large penalty in iteration count.

  Even with div1 deleted, the code handles q == 1 specially, and only
  divides when q >= 2.

I think 16 cycles should be ueasy to beat.

Even with Intel's slow lzcnt (3 cycles latency, 1/cycle throughput)
something like this should be faster;

count_leading_zeros (ncnt, n0)
count_leading_zeros (dcnt, d0)
ni = ... extract high k bits from n0 ...
di = ... extract high \ell bits from d0 ...
q = qtab[ni<<\ell+di];
... adjust ...

One may also keep track of the msb for n0 and d0 for the benefit of
machines with slow lzcnt.

An alternative might be something along the lines of

if (n0 >= (d0 << 3))
  q = n0 / d0;
else {
  q = 0;
  if (n0 > (d0 << 2))
    q += 4
    n0 -= d0 << 2
  if (n0 > (d0 << 1))
    q += 2
    n0 -= d0 << 1
  if (n0 > (d0 << 0))
    q += 1
    n0 -= d0
  }

with masking to avoid any branches from the "else" clause and onwards.

-- 
Torbjörn
Please encrypt, key id 0xC8601622