Division with on-the-fly normalization
tg at gmplib.org
Fri Mar 18 14:16:43 CET 2011
nisse at lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund <tg at gmplib.org> writes:
> OK, so you're suggestng that one replicates the most significant part of
> the divisor here? I assume you want to keep it normalised, i.e., the
> msb of dh = 1...
Yes. I figured that since we need to read three limbs and shift them in
order to get the normalized two limbs needed to compute the 3/2 inverse,
we could just as well store those normalized divisor limbs since they
will be needed later on.
I agree--in the unnormalised case. (But as I expressed some days ago, I
am not convinced we should have 3, since 2 might result in faster code.)
> At the lowest mpn level, I think we should probably have one loop per
> function. So we should have different functions for plain SB division
> for the case where operands will be shifted on-the-fly and for the case
> where they do not.
Ok with me. When will the choice of function to be called be made? For
each recursive call from the dc functions to sb functions?
I expect the DC functions to continue to require normalised operands, so
the basecase calls from there will be to the "sb-norm" variant. Right?
I suspect it might give a (very small) saving to precompute the
normalized top limbs once, at the top of the divide-and-conquer tree. But
I haven't thought very carefully about that.
I don't understand which context you are thinking of here.
> I don't thing the latter case should be burdened
> down by a 5-word structure, since filling that out will take a few extra
> cycles, and a few cycles matter for SB division.
> Do you agree?
I don't know. My understanding up to now has been that code using the
pi1 family of division functions should be able call invert_pi1 followed
by, e.g., mpn_dcpi1_div_qr, without any additional checks for normalized
vs unnormalized divisor.
Which is not how it works today, for sure.
Then I think the gmp_pi1_t structure should at
least include the shift count (since count_leading_zeros is potentially
a bit expensive), and it seems natural to check that shift count at the
start of each sb function rather than burden the caller with that.
Do you think the common case will be normalized or unnormalized
divisors? I imagine unnormalized will be the common case, but that's
because I also think that we should remove most or all normalizing
shifts from callers.
I am not so sure. Since DC work with normalised divisors, all its calls
to SB will be with normalised divisors (the upper parts of the full
There might be other cases, such as in some situation with invariance,
where pre-normalisation will make sense.
But for user calls of "random" operands, unnormalised divisors will be
much less common.
If you prefer, it's no big deal remove the three normalized dh limbs (or
keep just two of them, the input to 3/2 inversion, if that makes more
sense) and recompute when needed. But I think you've said earlier that
you do want to keep the shift count there, so that it's really a struct
rather than a single limb?
For unnormalised divisors, I think the topmost few divisor limbs make
sense to put in the structure, and so does the shift count. For
normalised divisors, I cannot see the point of having a field which is
BTW, which are the smallest divisors you want the new SB function(s) to
support? In the normalised case, might be possible, but it might be
easier to set he limit to 3. In the unnormalised case, if you want bits
from the limbs in the precomp structure, 3 seems like a natural smallest
One could also arrange so that those additional three limbs are neither
written nor read in the normalized case (i.e., make them valid iff shift
> 0). Would that be ok with you? It shouldn't be much of a burden to
sometimes have three unused limbs somewhere on the stack.
That's perhaps an alternative. Or are there reasonable scenarios where
one would need to keep lots of such structures around?
> I am not sure I understand this question. In general, the lowest-level
> division functions today require normalisation, SB, DC, and MU.
For the dc-code, I'm fairly sure the only reason it currently requires
normalization is because it calls the sb functions, which requires the
divisor to be normalized.
I think this is wrong, as the functions are implemented today. We need
to form the DC quotient approximation upon >= half the divisor bits.
(Sure, it would be possible to cut up the operands differently, making
the upper half have 1 or 2 more limbs than the lower half.)
For the mu-code, I was under the impression that it (and the
computation of the inverse) only needed dp[dn-1] > 0, not dp[dn-1] >
B/2. But I may be wrong.
I took a quick look at mpn_invertappr and it seems to require strict
"bit normalisation" (not just "limb normalisation").
I think some kind of "lazy normalization" makes sense whenever the
quotient is small in comparison to the divisor. If I understand you
correctly, current mpn_div_qr normalizes only the upper qn + small limbs
of d and 2qn + small limbs of u up front, and you want to keep this
strategy as long as it's calling the mu (aka blockwise barrett)
If you mean mpn_div_q not mpn_div_qr, yes, it works as you suggest.
(There is no mpn_div_qr, but a mpn_tdiv_qr, where such tricks pay off
less, since one needs to compute the remainder anyway.)
Finally, for very large operands, memory footprint gets important, and
then it's also beneficial not to have to keep shifted versions of the
Division function design always becomes very tricky. Perhaps we should
focus on SB division for now, and in particular the lowest level loops?
1. We need a loop for unnormalised operands with certain precomputed
operands. This code will be used for very small operands, and we
should keep all overhead down. We should probably provide a user
accessible entry point to this loop, using an opaque precomputation
2. We need a similar loop for normalised operands, with certain
3. We need higher-level functions announced to the user (perhaps
mpn_div_q, mpn_div_qr) without any precomputation. These will call
the functions defined in (1) and (2), among other functions. Let's
worry about this, but perhaps not right now.
4. We need higher-level functions announced to the user, with
precomputation, including precomputation implied by the underlying
multiplication algorithm (e.g., an FFT transform, or a "Toom
transform"). Again, let's not worry about this.
5. We might need new DC and MU functions too...
More information about the gmp-devel