div_qr_1 interface

Wed Oct 23 01:32:46 CEST 2013

I added data for the new code at <http://gmplib.org/devel/asm.html>.

There is a line for div_qr_1u_pi1 as well, since that will also be
needed.  It might actually be more common that the divisor is not
normalised.

I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and
then add threshold for the non-invariant case.  If my old data for those
are correct, then it is always faster for large enough operands.

I expect div_qr_1u_pi1 to be no slower than div_qr_1n_pi1 on some
machines, just like divrem_1 is often the same speed for normalised and
unnormalised divisors (sometimes using one loop, sometimes using two).

To use just one loop, we probably need an efficient shrd, since then the
normalised case just mean a shift count of 0.  (Only Intel's high-end
processors run shrd well.)

I suppose we should provide just div_qr_1_pi1 when a general loop is
fast.

-- 
Torbjörn