mpn_divexact_1 comments

Wed Oct 16 15:16:07 CEST 2013

I agree with Niels (don't understand + not appropriate for
low-level...).

We should replace mpn_divexact_1 with code that:

(1) Uses Jebelean's trick with a Euclidean division working left-to-
    right and a simultaneous Hensel division working right-to-left.
    This is faster in the single-limb divisor case since both division
    algorithms (Euclidean and Hensel) are typically limited by multiply
    instruction latency.

(2) Uses a 2-limb modular inverse but only uses Hensel norm.

Optimal code without divisor invariance would probably look like this:

if (n < THRESHOLD_1)
  binv = 1/d mod B
  mpn_bdiv_1_pi1 (...)
else if (n < THRESHOLD_2)
  binv = 1/d mod B²
  mpn_bdiv_1_pi2 (...)
else
  binv = 1/d mod B
  inv  = (B²-1)/d
  mpn_div_1_pi1_bdiv_1_pi1 (...)

(This code assumes the divisor d is odd and the limb base B has no
common factor...)

For CPUs with pipelined multiply insns, I'd expect THRESHOLD_1 to be
small, perhaps 3 to 4.  THRESHOLD_2 will be much larger, perhaps around
15.  (Without pipelined multiply, mpn_bdiv_1_pi1 will always be faster.)

-- 
Torbjörn