Threading

Mon May 4 11:33:03 CEST 2009

Marc Glisse <marc.glisse at normalesup.org> writes:

     ASSERT_ALWAYS(pl3 <= pl);
  -  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
     pad_op = __GMP_ALLOCATE_FUNC_LIMBS (pl2);
  +#pragma omp parallel
  +{
  +#pragma omp sections
  +{
  +#pragma omp section
  +  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
  +#pragma omp section
     mpn_mul_fft (pad_op, pl2, n, nl, m, ml, k2); /* lambda */
  +}
  +}
     cc = mpn_sub_n (pad_op, pad_op, op, pl2);    /* lambda - low(mu) */
     /* 0 <= cc <= 1 */
     l = pl3 - pl2; /* l = pl2 / 2 since pl3 = 3/2 * pl2 */

  I am definitely not an openmp expert and I am not familiar with any
  MPN code, I hope I still wrote something correct (it passes the
  testsuite, but the testsuite was not written with this kind of abuse
  in mind).

If the testsuite passed, the likelihood it works is quite high.

You might want to use the (intensionally undocumented) GMP_CHECK_FFT and
GMP_CHECK_RANDOMIZE environment variables:

export GMP_CHECK_FFT=29
export GMP_CHECK_RANDOMIZE=yes
while true; fo tests/mpz/t-mul || break; done

  Notice how discrete the pragmas are. When they are ignored (normal
  mode), the code is the same as the current one. If I now compile gmp
  with -fopenmp, the runtime of a multiplication between integers of
  10^6 limbs is reduced by 20 to 30% when I enable a second thread.

30% speedup is nice, but I'd expect close to 100% with another processor
core.  One reason for this, is that the two FFT calls you apparently
parallelise are of quite different sizes.

A better place to parallelise--at least for large enough operands--might
be mpn_fft_mul_modF_K.  There are plenty of parallelism there, but it is
also more fine grained than where you put the pragmas.

  There are other places (in fft, in toom, but also in higher levels
  like mpq) where similar things are possible, but with just a dual-core
  laptop it did not make sense to push too far. The FFT code is only
  used for large numbers, but in other places it would make sense to
  only allow parallelism above some threshold, like (untested):
  #pragma omp parallel if(nl>5000)

I very much doubt you'd get any benefits from doing Karatsuba parallel.
The cost of destructive cache sharing will very likely outweigh any
benefits.

Ultimately, GMP might see some automatic parallelisation. But there is
still so much more low hanging fruit, so I'll prefer not to use too much
more on this at present.  But feel free to play with it; I'll
incorporate clean patches if people agree they might be beneficial.

-- 
Torbjörn