tg at gmplib.org
Sun Dec 20 18:48:52 CET 2009
bodrato at mail.dm.unipi.it writes:
We used the arbitrary condition (un >= 8*vn), then I'd suggest to iterate
mpn_fft_mul (prodp, up, 7*vn, vp, vn);
I went for 3vn,vn now. I need to take care of the next task for the
release, so I'll leave this somewhat suboptimal for the moment.
We now get close to 2x speedup compared to the mul.c from a few days
ago, for unbalanced operands in the FFT range.
If we model the cost of FFT with M(a,b)=(a+b)*log(a+b)*log(log(a+b)), we
can compare the two costs M(14*vn,vn) and 2*M(7*vn,vn), and see that the
first is bigger than the second up to a quite big vn.
Above that, one should probably switch to some 2*M(10*vn,vn), and so on.
But, for the moment, I think we can accept a "suboptimal" algorithm for
operands that are both: very big, and very unbalanced.
Interesting. Let's return to this!
We could probably achieve additional speedup:
1. Avoid leaving a small U piece after the loop. Instead expand the
penultimate block (or all blocks).
2. Round the C in C*vn,vn to an optimal FFT point. This is not hard.
In the beginning of mpn_mul_fft of mul_fft.c we have inspiration:
goodness (mp_size_t pl, int k)
mp_size_t N, Nprime, nprime, M;
N = pl * GMP_NUMB_BITS;
M = N >> k;
maxLK = mpn_mul_fft_lcm (GMP_NUMB_BITS, k);
Nprime = (1 + (2 * M + k + 2) / maxLK) * maxLK;
nprime = Nprime / GMP_NUMB_BITS;
if (nprime >= MUL_FFT_MODF_THRESHOLD)
unsigned long K2;
K2 = 1L << mpn_fft_best_k (nprime, sqr);
if ((nprime & (K2 - 1)) == 0)
nprime = (nprime + K2 - 1) & -K2;
Nprime = nprime * GMP_LIMB_BITS;
return 200 * (N >> k) / Nprime;
This could be used in mpn_mulmod_bnm1 too, I suppose.
3. Do some simple surgery into mul_fft.c for saving the transform of the
operand V. This could save > 25% more.
We don't need to be hysterical about overhead in this context, since the
actual computation will vastly dominate any sensible strategy computation.
More information about the gmp-devel