Division with on-the-fly normalization

Fri Mar 18 14:16:43 CET 2011

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > OK, so you're suggestng that one replicates the most significant part of
  > the divisor here?  I assume you want to keep it normalised, i.e., the
  > msb of dh[2] = 1...

  Yes. I figured that since we need to read three limbs and shift them in
  order to get the normalized two limbs needed to compute the 3/2 inverse,
  we could just as well store those normalized divisor limbs since they
  will be needed later on. 

I agree--in the unnormalised case.  (But as I expressed some days ago, I
am not convinced we should have 3, since 2 might result in faster code.)

  > At the lowest mpn level, I think we should probably have one loop per
  > function.  So we should have different functions for plain SB division
  > for the case where operands will be shifted on-the-fly and for the case
  > where they do not.

  Ok with me. When will the choice of function to be called be made? For
  each recursive call from the dc functions to sb functions?

I expect the DC functions to continue to require normalised operands, so
the basecase calls from there will be to the "sb-norm" variant.  Right?

  I suspect it might give a (very small) saving to precompute the
  normalized top limbs once, at the top of the divide-and-conquer tree. But
  I haven't thought very carefully about that.

I don't understand which context you are thinking of here.

  > I don't thing the latter case should be burdened
  > down by a 5-word structure, since filling that out will take a few extra
  > cycles, and a few cycles matter for SB division.
  >
  > Do you agree?

  I don't know. My understanding up to now has been that code using the
  pi1 family of division functions should be able call invert_pi1 followed
  by, e.g., mpn_dcpi1_div_qr, without any additional checks for normalized
  vs unnormalized divisor.

Which is not how it works today, for sure.

  Then I think the gmp_pi1_t structure should at
  least include the shift count (since count_leading_zeros is potentially
  a bit expensive), and it seems natural to check that shift count at the
  start of each sb function rather than burden the caller with that.

  Do you think the common case will be normalized or unnormalized
  divisors? I imagine unnormalized will be the common case, but that's
  because I also think that we should remove most or all normalizing
  shifts from callers.

I am not so sure.  Since DC work with normalised divisors, all its calls
to SB will be with normalised divisors (the upper parts of the full
divisors).

There might be other cases, such as in some situation with invariance,
where pre-normalisation will make sense.

But for user calls of "random" operands, unnormalised divisors will be
much less common.

  If you prefer, it's no big deal remove the three normalized dh limbs (or
  keep just two of them, the input to 3/2 inversion, if that makes more
  sense) and recompute when needed. But I think you've said earlier that
  you do want to keep the shift count there, so that it's really a struct
  rather than a single limb?

For unnormalised divisors, I think the topmost few divisor limbs make
sense to put in the structure, and so does the shift count.  For
normalised divisors, I cannot see the point of having a field which is
always 0.

BTW, which are the smallest divisors you want the new SB function(s) to
support?  In the normalised case, might be possible, but it might be
easier to set he limit to 3.  In the unnormalised case, if you want bits
from the limbs in the precomp structure, 3 seems like a natural smallest
size.

  One could also arrange so that those additional three limbs are neither
  written nor read in the normalized case (i.e., make them valid iff shift
  > 0). Would that be ok with you? It shouldn't be much of a burden to
  sometimes have three unused limbs somewhere on the stack.

That's perhaps an alternative.  Or are there reasonable scenarios where
one would need to keep lots of such structures around?

  > I am not sure I understand this question.  In general, the lowest-level
  > division functions today require normalisation, SB, DC, and MU.

  For the dc-code, I'm fairly sure the only reason it currently requires
  normalization is because it calls the sb functions, which requires the
  divisor to be normalized.

I think this is wrong, as the functions are implemented today.  We need
to form the DC quotient approximation upon >= half the divisor bits.
(Sure, it would be possible to cut up the operands differently, making
the upper half have 1 or 2 more limbs than the lower half.)

  For the mu-code, I was under the impression that it (and the
  computation of the inverse) only needed dp[dn-1] > 0, not dp[dn-1] >
  B/2. But I may be wrong.

I took a quick look at mpn_invertappr and it seems to require strict
"bit normalisation" (not just "limb normalisation").

  I think some kind of "lazy normalization" makes sense whenever the
  quotient is small in comparison to the divisor. If I understand you
  correctly, current mpn_div_qr normalizes only the upper qn + small limbs
  of d and 2qn + small limbs of u up front, and you want to keep this
  strategy as long as it's calling the mu (aka blockwise barrett)
  functions?

If you mean mpn_div_q not mpn_div_qr, yes, it works as you suggest.
(There is no mpn_div_qr, but a mpn_tdiv_qr, where such tricks pay off
less, since one needs to compute the remainder anyway.)

  Finally, for very large operands, memory footprint gets important, and
  then it's also beneficial not to have to keep shifted versions of the
  operands around.

Sure.

Division function design always becomes very tricky.  Perhaps we should
focus on SB division for now, and in particular the lowest level loops?

1. We need a loop for unnormalised operands with certain precomputed
   operands.  This code will be used for very small operands, and we
   should keep all overhead down.  We should probably provide a user
   accessible entry point to this loop, using an opaque precomputation
   structure.

2. We need a similar loop for normalised operands, with certain
   precomputed operands...

3. We need higher-level functions announced to the user (perhaps
   mpn_div_q, mpn_div_qr) without any precomputation.  These will call
   the functions defined in (1) and (2), among other functions.  Let's
   worry about this, but perhaps not right now.

4. We need higher-level functions announced to the user, with
   precomputation, including precomputation implied by the underlying
   multiplication algorithm (e.g., an FFT transform, or a "Toom
   transform").  Again, let's not worry about this.

5. We might need new DC and MU functions too...

-- 
Torbjörn