Division with on-the-fly normalization

Fri Mar 18 12:50:41 CET 2011

Torbjorn Granlund <tg at gmplib.org> writes:

> OK, so you're suggestng that one replicates the most significant part of
> the divisor here?  I assume you want to keep it normalised, i.e., the
> msb of dh[2] = 1...

Yes. I figured that since we need to read three limbs and shift them in
order to get the normalized two limbs needed to compute the 3/2 inverse,
we could just as well store those normalized divisor limbs since they
will be needed later on. 

> At the lowest mpn level, I think we should probably have one loop per
> function.  So we should have different functions for plain SB division
> for the case where operands will be shifted on-the-fly and for the case
> where they do not.

Ok with me. When will the choice of function to be called be made? For
each recursive call from the dc functions to sb functions?

I suspect it might give a (very small) saving to precompute the
normalized top limbs once, at the top of the divide-and-conquer tree. But
I haven't thought very carefully about that.

> I don't thing the latter case should be burdened
> down by a 5-word structure, since filling that out will take a few extra
> cycles, and a few cycles matter for SB division.
>
> Do you agree?

I don't know. My understanding up to now has been that code using the
pi1 family of division functions should be able call invert_pi1 followed
by, e.g., mpn_dcpi1_div_qr, without any additional checks for normalized
vs unnormalized divisor. Then I think the gmp_pi1_t structure should at
least include the shift count (since count_leading_zeros is potentially
a bit expensive), and it seems natural to check that shift count at the
start of each sb function rather than burden the caller with that.

Do you think the common case will be normalized or unnormalized
divisors? I imagine unnormalized will be the common case, but that's
because I also think that we should remove most or all normalizing
shifts from callers.

If you prefer, it's no big deal remove the three normalized dh limbs (or
keep just two of them, the input to 3/2 inversion, if that makes more
sense) and recompute when needed. But I think you've said earlier that
you do want to keep the shift count there, so that it's really a struct
rather than a single limb?

One could also arrange so that those additional three limbs are neither
written nor read in the normalized case (i.e., make them valid iff shift
> 0). Would that be ok with you? It shouldn't be much of a burden to
sometimes have three unused limbs somewhere on the stack.

> I am not sure I understand this question.  In general, the lowest-level
> division functions today require normalisation, SB, DC, and MU.

For the dc-code, I'm fairly sure the only reason it currently requires
normalization is because it calls the sb functions, which requires the
divisor to be normalized. So if we make the sb functions do
normalization on the fly, then I think we can easily get rid of the
normalization requirement also for the dc functions. (Ok, also the
current qh logic in mpn_dcpi1_div_qr seems to depend on normalization,
but I hope that's not fundamental to the algorithm).

For the mu-code, I was under the impression that it (and the
computation of the inverse) only needed dp[dn-1] > 0, not dp[dn-1] >
B/2. But I may be wrong.

> While I agree that we should add SB division that does not require
> normalisation, I am not so sure there is much point in rewriting the
> larger-operand division code to handle unnormalised divisors.  For these
> larger operands, the potential saved time would be a very small fraction
> of the total time.  For SB division on the other hand, there should be
> significant savings possible.

I think some kind of "lazy normalization" makes sense whenever the
quotient is small in comparison to the divisor. If I understand you
correctly, current mpn_div_qr normalizes only the upper qn + small limbs
of d and 2qn + small limbs of u up front, and you want to keep this
strategy as long as it's calling the mu (aka blockwise barrett)
functions?

Finally, for very large operands, memory footprint gets important, and
then it's also beneficial not to have to keep shifted versions of the
operands around.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.