tg at gmplib.org
Tue Sep 3 10:28:43 UTC 2019
nisse at lysator.liu.se (Niels Möller) writes:
In that case, not so surprising that the div1 function loses. Do other
architectures also have decent performance for small-quotient division?
Do you think table lookup on high bits should beat 16 cycles? It needs
to give good enough accuracy (possibly with an adjustment step) to not
result in a large penalty in iteration count.
Even with div1 deleted, the code handles q == 1 specially, and only
divides when q >= 2.
I think 16 cycles should be ueasy to beat.
Even with Intel's slow lzcnt (3 cycles latency, 1/cycle throughput)
something like this should be faster;
count_leading_zeros (ncnt, n0)
count_leading_zeros (dcnt, d0)
ni = ... extract high k bits from n0 ...
di = ... extract high \ell bits from d0 ...
q = qtab[ni<<\ell+di];
... adjust ...
One may also keep track of the msb for n0 and d0 for the benefit of
machines with slow lzcnt.
An alternative might be something along the lines of
if (n0 >= (d0 << 3))
q = n0 / d0;
q = 0;
if (n0 > (d0 << 2))
q += 4
n0 -= d0 << 2
if (n0 > (d0 << 1))
q += 2
n0 -= d0 << 1
if (n0 > (d0 << 0))
q += 1
n0 -= d0
with masking to avoid any branches from the "else" clause and onwards.
Please encrypt, key id 0xC8601622
More information about the gmp-devel