Division interfaces
Torbjorn Granlund
tg at gmplib.org
Tue Mar 15 14:09:34 CET 2011
nisse at lysator.liu.se (Niels Möller) writes:
nisse at lysator.liu.se (Niels Möller) writes:
> And then we have the issue of unnormalized divisors (it looks like
> mpn_sbpi1_div_qr still requires that dp[dn-1] has the high bit set.
I'm attaching a first crude version with on-the-fly normalization of the
high limbs.
If one reads the top 3 limbs of n and top 2 of d and shift left to
normalize d (shifting in zeros), and use 3/2 division, one will in
effect do a (2 limbs + k bits)/(1 limb + k bits) division, where 0 < k =
(GMP_LIMB_BITS - shift_count). Then the probability of the "unlikely"
update in schoolbook division is on the order of 2^{-k}, which for small
k (large shift count) actually isn't very unlikely.
For this reason, the code uses 4 limbs of n and 3 limbs of d (and a 4/3
division built from 3/2 division and an extra update step). This is
overkill when the shift count is small, so maybe one should have three
different variants,
1. Normalized d.
2. Shift count < (GMP_LIMB_BITS - 10). Use 3/2 division, the "unlikely"
update happens with probability < 2^{-10}.
3. Larger shift counts. Use 4/3 division (like the attached code), then
the unlikely update probability is < 2^{-GMP_LIMB_BITS}.
There is actually some need for:
4. Side-channel silent SB division. The O(log(D)) corrections must be
performed always, with their effect nullified when not needed
(using mpn_subcnd_n). Here, we should not worry about making a
quotient limb approximation correct with particularly high
probability.
I looked briefly at your code. Wouldn't it make sense to stay with 3/2
division, using the 2 x GMP_NUMB_BITS most significant bits from the
divisor, and the 3 x GMP_NUMB_BITS bits from the partial remainder?
Thus, feed udiv_qr_3by2 with *exactly* the sequence of parameters it
would get if we instead had pre-shifted the divisor and dividend.
The 3 x GMP_NUMB_BITS partial remainder bits would need to be formed by
reading the 4 most significant partial remainder limbs from memory. But
since we during quotient approximation (i.e., in udiv_qr_3by2) compute a
part of the next partial remainder, we can shorten the submul_1 trip
count, and not only save submul_1 execution time, but also simplify
formation of the next 3 partial remainder limbs that are to be fed to
udiv_qr_3by2.
Am I missing some reason for udiv_qr_4by3?
(I'd say to skip the 2^{-10} variant for now, and concentrate on making
the generic version as fast as possible.)
--
Torbjörn
More information about the gmp-devel
mailing list