div_qr_1n_pi1

Mon Jun 7 09:50:59 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  I've tried it out. Works nicely, but no speedup on my machine. I'm
  attaching another patch. There are then 4 methods:

  method 1: Old loop around udiv_qrnnd_preinv.

  method 2: The clever code from 10 years ago, with the microoptimization
    I commited the other day.

  method 3: More or less the same as I posted a few days ago.

  method 4: Postpones the update u1 -= u2 d, off the critical recurrency
    chain. Instead, conditionally adds in the constant B2 (B - d) to the
    lower u limbs.

Let me throw some additional complication into this project:

The fastest way to do a division by a single limb will involve more
precomputation, assuming we have good throughput of the multiply
instruction.

Even in some scenario where one has a 1-limb inverse precomputed, it
will be faster to first do a Newton round to get a 2-limb inverse, and
then run a loop which develops two quotient limbs at a time.  I expect
the dividend limb count cutoff point to be small, surely less than 10.

With full precomputation, the dividend limb count cutoff point would be
2, I presume.

The function div_qr_1n_pi1 is useful mainly for CPUs without a pipelined
multiply instruction.  And of course also useful until we have great
div_qr_1n_piN for N >= 2.

Improvements to div_qr_1n_piN are of course theoretically interesting
too.

-- 
Torbjörn
Please encrypt, key id 0xC8601622