div_qr_1n_pi1
Torbjörn Granlund
tg at gmplib.org
Mon Jun 7 09:50:59 UTC 2021
nisse at lysator.liu.se (Niels Möller) writes:
I've tried it out. Works nicely, but no speedup on my machine. I'm
attaching another patch. There are then 4 methods:
method 1: Old loop around udiv_qrnnd_preinv.
method 2: The clever code from 10 years ago, with the microoptimization
I commited the other day.
method 3: More or less the same as I posted a few days ago.
method 4: Postpones the update u1 -= u2 d, off the critical recurrency
chain. Instead, conditionally adds in the constant B2 (B - d) to the
lower u limbs.
Let me throw some additional complication into this project:
The fastest way to do a division by a single limb will involve more
precomputation, assuming we have good throughput of the multiply
instruction.
Even in some scenario where one has a 1-limb inverse precomputed, it
will be faster to first do a Newton round to get a 2-limb inverse, and
then run a loop which develops two quotient limbs at a time. I expect
the dividend limb count cutoff point to be small, surely less than 10.
With full precomputation, the dividend limb count cutoff point would be
2, I presume.
The function div_qr_1n_pi1 is useful mainly for CPUs without a pipelined
multiply instruction. And of course also useful until we have great
div_qr_1n_piN for N >= 2.
Improvements to div_qr_1n_piN are of course theoretically interesting
too.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list