Division interfaces

Tue Mar 15 14:09:34 CET 2011

nisse at lysator.liu.se (Niels Möller) writes:

  nisse at lysator.liu.se (Niels Möller) writes:

  > And then we have the issue of unnormalized divisors (it looks like
  > mpn_sbpi1_div_qr still requires that dp[dn-1] has the high bit set.

  I'm attaching a first crude version with on-the-fly normalization of the
  high limbs.

  If one reads the top 3 limbs of n and top 2 of d and shift left to
  normalize d (shifting in zeros), and use 3/2 division, one will in
  effect do a (2 limbs + k bits)/(1 limb + k bits) division, where 0 < k =
  (GMP_LIMB_BITS - shift_count). Then the probability of the "unlikely"
  update in schoolbook division is on the order of 2^{-k}, which for small
  k (large shift count) actually isn't very unlikely.

  For this reason, the code uses 4 limbs of n and 3 limbs of d (and a 4/3
  division built from 3/2 division and an extra update step). This is
  overkill when the shift count is small, so maybe one should have three
  different variants,

  1. Normalized d.

  2. Shift count < (GMP_LIMB_BITS - 10). Use 3/2 division, the "unlikely"
     update happens with probability < 2^{-10}.

  3. Larger shift counts. Use 4/3 division (like the attached code), then
     the unlikely update probability is < 2^{-GMP_LIMB_BITS}.

There is actually some need for:

  4. Side-channel silent SB division.  The O(log(D)) corrections must be
     performed always, with their effect nullified when not needed
     (using mpn_subcnd_n).  Here, we should not worry about making a
     quotient limb approximation correct with particularly high
     probability.

I looked briefly at your code.  Wouldn't it make sense to stay with 3/2
division, using the 2 x GMP_NUMB_BITS most significant bits from the
divisor, and the 3 x GMP_NUMB_BITS bits from the partial remainder?
Thus, feed udiv_qr_3by2 with *exactly* the sequence of parameters it
would get if we instead had pre-shifted the divisor and dividend.

The 3 x GMP_NUMB_BITS partial remainder bits would need to be formed by
reading the 4 most significant partial remainder limbs from memory.  But
since we during quotient approximation (i.e., in udiv_qr_3by2) compute a
part of the next partial remainder, we can shorten the submul_1 trip
count, and not only save submul_1 execution time, but also simplify
formation of the next 3 partial remainder limbs that are to be fed to
udiv_qr_3by2.

Am I missing some reason for udiv_qr_4by3?

(I'd say to skip the 2^{-10} variant for now, and concentrate on making
the generic version as fast as possible.)

-- 
Torbjörn