Division / remaindering of small numbers using a preinverse

Sun Sep 10 12:15:17 CEST 2006

I've been trying to optimize the chinese remaindering routines of my
small-prime fft code. In this context, I need to compute x mod m for
pretty small numbers. The cases I have needed so far are:

  xn  mn
   4   3   For (naive?) 3-prime crt
   4   2   For 4-prime crt (first using two 2-prime crts)

As far as I can see, montgomery reduction is not of any use here, so I
need a proper division. From which I need only the remainder, not the
quotient. But the m:s are constants, so it makes a lot of sense to ise
a preinverse. I've come up with the appended code, using the inverse

  di = invert_limb(m+1)
     = floor( (2^{2 GMP_LIMB_BITS} - 1) / (mh + 1) - 2^GMP_LIMB_BITS

under the assumptions (which hold in my case) that m is normalized,
i.e., most significant bit set, and that mh + 1 fits in a single limb:

  2^{GMP_LIMB_BITS - 1} <= mh < 2^{GMP_LIMB_BITS - 1}

The idea is that I want to round the divisor m *upwards*, to ensure
that I can never get a too large quotient. For the discussion here,
consider the basic case that mn = xn + 1. 

I wrote two variants of this code:

* One that uses the inverse directly to compute an approximate
  quotient

    q = floor (xh 2^{GMP_LIMB_BITS} / (mh + 1))
      = floor (xh * di * 2^{- GMP_LIMB_BITS}) + xh

  which may be up to two units smaller than the true quotient.

* One which uses udiv_qrnnd_preinv to divide the *two* top limbs of x
  by mh + 1. The quotient may still be two units too small (I think),
  but it is less likely, and with a tighter bound for the remainder x
  - m q than in the previous case.

Both are significantly faster than using mpn_tdiv. Running time on my
home PC (for some fixed random inputs; I haven't averaged the the over
random inputs):

xn-mn xn     tdiv      rem1      rem2
   1   2 0.000128  0.800373  0.770522#
   1   3 0.000181  0.568421  0.518421#
   1   4 0.000207  0.520690  0.460920#
   1   5 0.000305  0.490625  0.464062#
   1   6 0.000316  0.406344# 0.480363
   1   7 0.000330  0.471098  0.417630#
   2   2 0.000159  0.835329# 0.856287
   2   3 0.000283  0.500000# 0.537037
   2   4 0.000283  0.501684# 0.538721
   2   5 0.000323  0.610619  0.486726#
   2   6 0.000317  0.683735  0.594880#
   2   7 0.000458  0.645833  0.444792#
   3   2 0.000192  1.024814  0.883375#
   3   3 0.000291  0.590164# 0.632787
   3   4 0.000419  0.461276# 0.511390
   3   5 0.000488  0.636719  0.568359#
   3   6 0.000509  0.784644  0.578652#
   3   7 0.000443  0.709052  0.633621#
   4   2 0.000222  0.935484  0.858065#
   4   3 0.000405  0.741176  0.689412#
   4   4 0.000378  0.825758# 0.911616
   4   5 0.000488  0.783203  0.648438#
   4   6 0.000515  0.811111  0.666667#
   4   7 0.000549  0.866319  0.729167#
   5   2 0.000257  1.081784  0.996283#
   5   3 0.000400  0.763723# 0.828162
   5   4 0.000500  0.652672# 0.763359
   5   5 0.000496  0.896154  0.790385#
   5   6 0.000589  0.980583  0.775081#
   5   7 0.000622  0.773006  0.740798#

The first two columns give the sizes of the quotient and the divisor.
The third column is the timing for gmp's mpn_tdiv. The last two
columns give the relative timing (i.e., absolute time divided by the
tdiv timing) for the two variants of rem_preinv.

So the preinv code is much faster for small sizes, in particular when
the quotient is just one or two limbs. Although for very small sizes,
it might make sense to inline the function, unroll the loop, replace
mpn_sub_n calls with sub_ddmmss, etc.

Regards,
/Niels

/* Computes the remainder n mod d, overwriting n. di is the inverse of
   of (dp[dn-1] + 1), i.e., the divisor rounded *upwards*. d is
   assumed to have the most significant bit set. The most significant
   limb of n is passed in as nh. */
void
rem_preinv1 (mp_ptr np, mp_size_t nn, mp_srcptr dp, mp_size_t dn, mp_limb_t di, mp_limb_t nh)
{
  mp_limb_t dh;
  int c;

  ASSERT (nn >= dn);
  ASSERT (dn > 1);

  dh = dp[dn-1];
  ASSERT (dh & GMP_NUMB_HIGHBIT);
  ASSERT (dh < GMP_NUMB_MAX);

  for (;;)
    {
      if (nh > dh)
	{
	  mp_limb_t cy;
	  nh -= dh;

	  cy = mpn_sub_n (np + nn - dn + 1, np + nn - dn + 1, dp, dn - 1);

	  ASSERT (cy <= nh);
	  nh -= cy;
	}

      if (nh > 0)
	{
	  mp_limb_t q, ql;
	  mp_limb_t cy;

	  /* The true inverse is 2^GMP_NUMB_BITS + di */
	  umul_ppmm (q, ql, nh, di);
	  q += nh;

	  /* There can be no overflow */
	  ASSERT (q >= nh);

	  cy = mpn_submul_1 (np + nn - dn, dp, dn, q);

	  ASSERT (cy <= nh);

	  nh -= cy;
	  ASSERT (nh <= 3);

	  while (nh)
	    nh -= mpn_submul_1 (np + nn - dn, dp, dn, nh);
	}

      if (nn == dn)
	break;

      nh = np[--nn];
    }

  MPN_CMP (c, np, dp, dn);
  if (c >= 0)
    ASSERT_NOCARRY (mpn_sub_n (np, np, dp, dn));
}

void
rem_preinv2 (mp_ptr np, mp_size_t nn, mp_srcptr dp, mp_size_t dn, mp_limb_t di, mp_limb_t nh)
{
  mp_limb_t dh;
  int c;

  ASSERT (nn >= dn);
  ASSERT (dn > 1);

  dh = dp[dn-1];
  ASSERT (dh & GMP_NUMB_HIGHBIT);
  ASSERT (dh < GMP_NUMB_MAX);

  for (;;)
    {
      if (nh > dh)
	{
	  mp_limb_t cy;
	  nh -= dh;

	  cy = mpn_sub_n (np + nn - dn + 1, np + nn - dn + 1, dp, dn - 1);

	  ASSERT (cy <= nh);
	  nh -= cy;
	}

      if (nh > 0)
	{
	  mp_limb_t q;
	  mp_limb_t cy;
	  mp_limb_t r;
	  udiv_qrnnd_preinv (q, r, nh, np[nn-1], dh + 1, di);

	  /* We divided by dh + 1, but we really want to divide by dh.
	     Since a = q (b+1) + r = q b + (q + r), we correct the
	     remainder by adding q. */
	  cy = mpn_submul_1 (np + nn - dn, dp, dn - 1, q);

	  /* The upper part of the remainder is given by r + q - cy */
	  q -= cy;
	  np[nn-1] = r + q;
	  nh = np[nn-1] < r;

	  ASSERT (nh <= 1);

	  while (nh)
	    nh -= mpn_sub_n (np + nn - dn, np + nn - dn, dp, dn);

	}
      if (nn == dn)
	break;

      nh = np[--nn];
    }
  MPN_CMP (c, np, dp, dn);
  if (c >= 0)
    ASSERT_NOCARRY (mpn_sub_n (np, np, dp, dn));
}