New mpn_divexact code

Thu Aug 3 15:52:10 CEST 2006

Paul.Zimmermann at loria.fr (Paul Zimmermann) writes:

  >       mpn_mul (tp, dp, dn, qp, in);
  > 
  > Here, {qp,in} is much smaller than {dp,dn} and your FFT code cleverly
  > transforms {qp,in} only once.

  Is that already in GMP?

Yes, it is in gmp 4.2.  Last pieces went in 2005-11-27.

  > Further speedups should be possible if {ip,in} where pre-transformed,
  > and then used in the mpn_mullow_n calls.  Perhaps pre-transformation
  > would be possible also for the comparatively large {dp,dn}?

  Yes of course.

Just to make it clear, does you precomputation FFT library help
if I am multiplying, say, a 1e9 digit operand with several
different 1e8 digit operands?  Does it then help to pre-transform
the 1e9 digit operand?

It is not immediately obvious to me how that could help.

  Here is something I wrote for the book I am writing with Richard Brent.
  Basically it says that an overlap of 2 bits is enough. Comments are welcome.

Thanks, I will try to print it and read it.

  > My Newton inversion code could surely be improved without involving
  > Jebelean's method.  There are comments in the file (see
  > http://swox.com/gmp/devel/mpn/generic/invert_2exp.c) from you ("Paul
  > Zimmermann spoke"), and comments by me about using short
  > multiplication (mpn_sqrhigh_mn, mpn_mulmid_mn).

  Yes we have to try this. Maybe it will save nothing.

I yesternight implemented what you suggested back in 2003, or at
least I think it is what I have implemented, though there are
some apparent discrepances.

You said back in 2003:

(2) you don't need the lshift: the recurrence can be written
    r[k+1] = r[k] + r[k] * (1 - r[k] * u), where you know that
    r[k] * u = 1 mod 2^(n0*GMP_NUMB_BITS), so I suggest the following:

    (a) {xp, n+n0} <- {rp, n0} * {u, n}	   [low n0 limbs of xp are 0]
    (b) {tp, 2*(n-n0)} <- {rp, n-n0} * {xp+n0, n-n0}
    (c) {rp+n0, n-n0} <- {tp, n-n0}

    You could even do (b) directly at rp, except perhaps for the last step
    if you have not enough memory at rp.

    Step (a) could benefit from the "middle product algorithm", since you
    want only the (n-n0) middle limbs from the full product.

    In step (b), you need only the low (n-n0) limbs of the product, thus
    a low short product would be useful (and thus you could avoid the copy
    of (c) even for the last step). The file mul_lo.c from ecm-5.0 contains
    such a low product (see below, but I guess Kevin also wrote one).

This sounds like,

  for (...)
    {
      mpn_mul (xp, up, 2 * rn, rp, rn);
      mpn_mullow_n (rp + rn, rp, xp + rn, newrn - rn);
      MPN_COPY (rp + rn, rp + rn, newrn - rn);
    }

but I don't understand how that could work.  My new code looks
like this,

  for (...)
    {
      mpn_mul (xp, up, 2 * rn, rp, rn);
      mpn_mullow_n (rp + rn, rp, xp + rn, newrn - rn);
      mpn_neg_n (rp + rn, rp + rn, newrn - rn);
    }

i.e., the MPN_COPY is instead a negation.

(Note that mpn_neg_n is not in GMP 4.2.)

I use 2*rn for the up size in the mpn_mul call, to be nice to
FFT.  (2*rn will be equal to newrn or to newrn+1.)

I get a fair speedup with that algorithm compared to the old
algorithm, which looked like this:

  for (...)
    {
      mpn_sqr_n (iip, rp, rn);			/* mpn_sqrhigh_mn */
      mpn_mullow_n (iiup, iip, up, newrn);	/* mpn_mulmid_mn */
      MPN_ZERO (rp + rn, newrn - rn);		/* ugly */
      mpn_lshift (rp, rp, newrn, 1);
      mpn_sub_n (rp, rp, iiup, newrn);
    }

The speedup seems to be 10% to 20%, wriggling a little up and
down depending on operand size.

The new code saves memory, the old code needed 2n+eps and the new
needs 3n/2+eps limbs.

The old code isn't too clever, I don't understand why I feed lots
if newly created zero limbs to mpn_lshift.  :-(

As written, the new code cannot use the mul_lo code from GMP-ECM,
since that apparently needs extra space at the destination, not
just the n requested low limbs.  The present GMP 4.2 mullow_n is
probably slower than the GMP-ECM code, but it fits better here.
I'll look into this soon.

-- 
Torbjörn