tg at gmplib.org
Mon May 4 11:33:03 CEST 2009
Marc Glisse <marc.glisse at normalesup.org> writes:
ASSERT_ALWAYS(pl3 <= pl);
- mpn_mul_fft (op, pl3, n, nl, m, ml, k3); /* mu */
pad_op = __GMP_ALLOCATE_FUNC_LIMBS (pl2);
+#pragma omp parallel
+#pragma omp sections
+#pragma omp section
+ mpn_mul_fft (op, pl3, n, nl, m, ml, k3); /* mu */
+#pragma omp section
mpn_mul_fft (pad_op, pl2, n, nl, m, ml, k2); /* lambda */
cc = mpn_sub_n (pad_op, pad_op, op, pl2); /* lambda - low(mu) */
/* 0 <= cc <= 1 */
l = pl3 - pl2; /* l = pl2 / 2 since pl3 = 3/2 * pl2 */
I am definitely not an openmp expert and I am not familiar with any
MPN code, I hope I still wrote something correct (it passes the
testsuite, but the testsuite was not written with this kind of abuse
If the testsuite passed, the likelihood it works is quite high.
You might want to use the (intensionally undocumented) GMP_CHECK_FFT and
GMP_CHECK_RANDOMIZE environment variables:
while true; fo tests/mpz/t-mul || break; done
Notice how discrete the pragmas are. When they are ignored (normal
mode), the code is the same as the current one. If I now compile gmp
with -fopenmp, the runtime of a multiplication between integers of
10^6 limbs is reduced by 20 to 30% when I enable a second thread.
30% speedup is nice, but I'd expect close to 100% with another processor
core. One reason for this, is that the two FFT calls you apparently
parallelise are of quite different sizes.
A better place to parallelise--at least for large enough operands--might
be mpn_fft_mul_modF_K. There are plenty of parallelism there, but it is
also more fine grained than where you put the pragmas.
There are other places (in fft, in toom, but also in higher levels
like mpq) where similar things are possible, but with just a dual-core
laptop it did not make sense to push too far. The FFT code is only
used for large numbers, but in other places it would make sense to
only allow parallelism above some threshold, like (untested):
#pragma omp parallel if(nl>5000)
I very much doubt you'd get any benefits from doing Karatsuba parallel.
The cost of destructive cache sharing will very likely outweigh any
Ultimately, GMP might see some automatic parallelisation. But there is
still so much more low hanging fruit, so I'll prefer not to use too much
more on this at present. But feel free to play with it; I'll
incorporate clean patches if people agree they might be beneficial.
More information about the gmp-devel