Division with on-the-fly normalization

Sat Mar 19 17:43:07 CET 2011

nisse at lysator.liu.se (Niels Möller) writes:

  > I expect the DC functions to continue to require normalised operands, so
  > the basecase calls from there will be to the "sb-norm" variant.  Right?

  I have the opposite expectation... This clearly has a big impact on
  everything.

Performance can always convince me.  If performance is the same, code
simplicity can also convince me.  :-)

  > For unnormalised divisors, I think the topmost few divisor limbs make
  > sense to put in the structure, and so does the shift count.  For
  > normalised divisors, I cannot see the point of having a field which is
  > always 0.

  Do you see a point in using the same struct type for the normalized and
  unnormalized case?

Not sure.  We can make it either way, and change it later of we prefer.

  > BTW, which are the smallest divisors you want the new SB function(s) to
  > support?  In the normalised case, might be possible, but it might be
  > easier to set he limit to 3.

  There's some word missing there.

No, a digit actually.  :-)  "2 might be possible" was the intended phrase.

  My current unnormalized code calls submul_1 with dn - 3 divisor limbs.
  So it would make sense to require at least 4 limbs, and my current test
  code uses divisors of bitsize >= 3 GMP_LIMB_BITS.

It would be nice to fix that, but not at the expense of speed.

[about DC division]

  The best way to cut the inputs in roughly half is not entirely obvious,
  but I doubt that normalized input in general is better. I suspect that
  for example the case of d being is an even number of limbs plus a dozen
  leftover bits is just as good or better than the cases of d being
  normalized of either even or odd limb count.

With normalised divisor, one may take ceil(dn/2) for the left part.
With unnormalised, one ill need to make sure the left part has strictly
more limbs than the right part.  (I am not trying to imply that this
means we necessarily need to go with normalised operands, I merely mean
that DC might cause us some minor headaches.  Ensuring--as we thus
might--that there are strictly more bits in the left part will allow for
ensuring we need at most *one* quotient adjustment at the end.)

  > I took a quick look at mpn_invertappr and it seems to require strict
  > "bit normalisation" (not just "limb normalisation").

  Can we agree on the following start?

  1. Use a common gmp_pi1_t struct for normalized and unnormalized case,
     including inverse, shift count and possibly normalized top divisor
     limbs (the latter not touched in the normalized case).

Fine.

  2. Separate sb entrypoints for normalized and unnormalized cases. Names?
     mod_1.c uses *_norm and *_unnorm, but those functions are static.
     (Personally I'd prefer to have a common advertised entrypoint, and do
     the two loops as static functions, but this isn't terribly important
     now).

I'd lie to at some point allow direct user access to the SB loops, since
an extra call in the call chain down to the actual dividing, will be
costly here.  The same is not true for DC.  (We can decide this later.)

     If we go with two entrypoints, the normalized one could stay with the
     current interface with a single mp_limb_t argument for the inverse,
     rather than a gmp_pi1_t pointer.

     One might also arrange things as an mpn_sbpi1_div_qr_norm (new name
     for current mpn_sbpi1_div_qr), and a general mpn_sbpi1_div_qr which
     starts with

      if (UNLIKELY (dinv->shift == 0))
        return mpn_sbpi1_div_qr_norm (...);

That's an idea.  The disadvantage being that internal calls which know
the operands are unnormalised will sustain some overhead.  This would be
bad, I think.

  First caller to be converted would then be mpn_tdiv_qr for small cases.
  We might want to split DC_DIV_QR_THRESHOLD into two thresholds, for
  normalized and unnormalized case.

Sounds reasonable.

  After that, I'd like to try out making the dc functions allow
  unnormalized inputs (with separate sb entrypoints, those functions will
  need somewhat complex inlined logic to chose betwen dcpi1, sbpi1_norm
  and sbpi1_unnorm for the recursive calls, depending on size and
  normalization. But in case we want different thresholds for normalized
  and unnormalized, I guess that's almost unavoidable).

  As for performance, I expect the following from on-the-fly normalization:

    * Case of a quotient of just a couple of limbs, and a large divisor:
      Significant improvement.

Agreed.

    * Small balanced (dn == qn) division: Really don't know, both ways to
      do normalization is O(qn), but with different unknown constants.

I expect speedup *with assembly*.  In C, register allocation and whatnot
might hurt performance.

    * Both dn and qn a few dozen limbs or more: Very small changes, the
      best choice possibly depending on the relative sizes of qn and dn.
      For either method, the linear work for normalization should be a
      very small fraction of the total division work.

    * Huge sizes: A significant saving in memory footprint. There's also a
      lot of temporary storage needed for the inverse and fft temporaries,
      but I imagine extra copies of the inputs could be 10% or more.

  motivated by performance. I expect that it will be tricky if we want to
  avoid all small regressions.

Makes sense.

-- 
Torbjörn