Replacement mpn/generic/mul.c

Sun Dec 20 18:48:52 CET 2009

bodrato at mail.dm.unipi.it writes:

  We used the arbitrary condition (un >= 8*vn), then I'd suggest to iterate
  mpn_fft_mul (prodp, up, 7*vn, vp, vn);

I went for 3vn,vn now.  I need to take care of the next task for the
release, so I'll leave this somewhat suboptimal for the moment.

We now get close to 2x speedup compared to the mul.c from a few days
ago, for unbalanced operands in the FFT range.

  If we model the cost of FFT with M(a,b)=(a+b)*log(a+b)*log(log(a+b)), we
  can compare the two costs M(14*vn,vn) and 2*M(7*vn,vn), and see that the
  first is bigger than the second up to a quite big vn.
  Above that, one should probably switch to some 2*M(10*vn,vn), and so on.
  But, for the moment, I think we can accept a "suboptimal" algorithm for
  operands that are both: very big, and very unbalanced.

Interesting.  Let's return to this!

We could probably achieve additional speedup:

1. Avoid leaving a small U piece after the loop.  Instead expand the
   penultimate block (or all blocks).

2. Round the C in C*vn,vn to an optimal FFT point.  This is not hard.
   In the beginning of mpn_mul_fft of mul_fft.c we have inspiration:

   int
   goodness (mp_size_t pl, int k)
   {
     mp_size_t maxLK;
     mp_size_t N, Nprime, nprime, M;

     N = pl * GMP_NUMB_BITS;
     M = N >> k;

     maxLK = mpn_mul_fft_lcm (GMP_NUMB_BITS, k);

     Nprime = (1 + (2 * M + k + 2) / maxLK) * maxLK;
     nprime = Nprime / GMP_NUMB_BITS;
     if (nprime >= MUL_FFT_MODF_THRESHOLD)
       {
         unsigned long K2;
         for (;;)
           {
             K2 = 1L << mpn_fft_best_k (nprime, sqr);
             if ((nprime & (K2 - 1)) == 0)
               break;
             nprime = (nprime + K2 - 1) & -K2;
             Nprime = nprime * GMP_LIMB_BITS;
           }
       }
     return 200 * (N >> k) / Nprime;
   }

  This could be used in mpn_mulmod_bnm1 too, I suppose.

3. Do some simple surgery into mul_fft.c for saving the transform of the
   operand V.  This could save > 25% more.

We don't need to be hysterical about overhead in this context, since the
actual computation will vastly dominate any sensible strategy computation.

-- 
Torbjörn