Division with on-the-fly normalization

Fri Mar 18 20:40:30 CET 2011

Torbjorn Granlund <tg at gmplib.org> writes:

> I expect the DC functions to continue to require normalised operands, so
> the basecase calls from there will be to the "sb-norm" variant.  Right?

I have the opposite expectation... This clearly has a big impact on
everything.

> For unnormalised divisors, I think the topmost few divisor limbs make
> sense to put in the structure, and so does the shift count.  For
> normalised divisors, I cannot see the point of having a field which is
> always 0.

Do you see a point in using the same struct type for the normalized and
unnormalized case?

> BTW, which are the smallest divisors you want the new SB function(s) to
> support?  In the normalised case, might be possible, but it might be
> easier to set he limit to 3.

There's some word missing there. Anyway, current normalized code
requires dn > 2, calling submul_1 with dn - 2 divisor limbs. We have /
divrem_2 which is a loop around udiv_qr_3by2.

My current unnormalized code calls submul_1 with dn - 3 divisor limbs.
So it would make sense to require at least 4 limbs, and my current test
code uses divisors of bitsize >= 3 GMP_LIMB_BITS.

But I think it would be good to add support for dn == 3 to the
unnormalized version, in order to (i) have compatible interface with the
normalized version, and (ii) to not leave an unsupported gap down to
divrem_2/div_qr_2. BTW, we should also have a div_qr_2 which does
on-the-fly normalization.

>   One could also arrange so that those additional three limbs are neither
>   written nor read in the normalized case (i.e., make them valid iff shift
>   > 0).
>   
> That's perhaps an alternative.  Or are there reasonable scenarios where
> one would need to keep lots of such structures around?

I have difficulty imagining such a scenario.

> I think this is wrong, as the functions are implemented today.  We need
> to form the DC quotient approximation upon >= half the divisor bits.
> (Sure, it would be possible to cut up the operands differently, making
> the upper half have 1 or 2 more limbs than the lower half.)

The best way to cut the inputs in roughly half is not entirely obvious,
but I doubt that normalized input in general is better. I suspect that
for example the case of d being is an even number of limbs plus a dozen
leftover bits is just as good or better than the cases of d being
normalized of either even or odd limb count.

> I took a quick look at mpn_invertappr and it seems to require strict
> "bit normalisation" (not just "limb normalisation").

And I think the main reason for that is that mpn_bc_invertappr (for
larger inputs, this function is used to compute the initial value for
the newton iteration) calls sb or dc division, and these functions
currently require normalized inputs.

> Division function design always becomes very tricky.  Perhaps we should
> focus on SB division for now, and in particular the lowest level loops?

Sure, let's start there. When done, I think we will find out that
normalization requirements for higher division routines are inherited
from the sb routines, and straight-forward to eliminate.

Can we agree on the following start?

1. Use a common gmp_pi1_t struct for normalized and unnormalized case,
   including inverse, shift count and possibly normalized top divisor
   limbs (the latter not touched in the normalized case).

2. Separate sb entrypoints for normalized and unnormalized cases. Names?
   mod_1.c uses *_norm and *_unnorm, but those functions are static.
   (Personally I'd prefer to have a common advertised entrypoint, and do
   the two loops as static functions, but this isn't terribly important
   now).

   If we go with two entrypoints, the normalized one could stay with the
   current interface with a single mp_limb_t argument for the inverse,
   rather than a gmp_pi1_t pointer.

   One might also arrange things as an mpn_sbpi1_div_qr_norm (new name
   for current mpn_sbpi1_div_qr), and a general mpn_sbpi1_div_qr which
   starts with

    if (UNLIKELY (dinv->shift == 0))
      return mpn_sbpi1_div_qr_norm (...);

First caller to be converted would then be mpn_tdiv_qr for small cases.
We might want to split DC_DIV_QR_THRESHOLD into two thresholds, for
normalized and unnormalized case.

After that, I'd like to try out making the dc functions allow
unnormalized inputs (with separate sb entrypoints, those functions will
need somewhat complex inlined logic to chose betwen dcpi1, sbpi1_norm
and sbpi1_unnorm for the recursive calls, depending on size and
normalization. But in case we want different thresholds for normalized
and unnormalized, I guess that's almost unavoidable).

As for performance, I expect the follwing from on-the-fly normalization:

  * Case of a quotient of just a couple of limbs, and a large divisor:
    Significant improvement.

  * Small balanced (dn == qn) division: Really don't know, both ways to
    do normalization is O(qn), but with different unknown constants.

  * Both dn and qn a few dozen limbs or more: Very small changes, the
    best choice possibly depending on the relative sizes of qn and dn.
    For either method, the linear work for normalization should be a
    very small fraction of the total division work.

  * Huge sizes: A significant saving in memory footprint. There's also a
    lot of temporary storage needed for the inverse and fft temporaries,
    but I imagine extra copies of the inputs could be 10% or more.

So my desire to get rid of normalization requirements is only partly
motivated by performance. I expect that it will be tricky if we want to
avoid all small regressions.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.