Threading (was: bipartite modular multiplication)

Thu Apr 30 13:27:50 CEST 2009

On Wed, 29 Apr 2009, Niels Möller wrote:

> When
> getting into the divide-and-conquer range, that seems harder (and
> I think threading at this level is not going to get into gmp).

Just out of curiosity, I looked at what happens when one tries some 
trivial threading inside gmp, for a regular multiplication (so completely 
unrelated to this discussion, actually). In mul_fft.c:

@@ -956,9 +967,17 @@
                  nl, ml, pl2, pl3, k2));

    ASSERT_ALWAYS(pl3 <= pl);
-  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
    pad_op = __GMP_ALLOCATE_FUNC_LIMBS (pl2);
+#pragma omp parallel
+{
+#pragma omp sections
+{
+#pragma omp section
+  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
+#pragma omp section
    mpn_mul_fft (pad_op, pl2, n, nl, m, ml, k2); /* lambda */
+}
+}
    cc = mpn_sub_n (pad_op, pad_op, op, pl2);    /* lambda - low(mu) */
    /* 0 <= cc <= 1 */
    l = pl3 - pl2; /* l = pl2 / 2 since pl3 = 3/2 * pl2 */

I am definitely not an openmp expert and I am not familiar with any MPN 
code, I hope I still wrote something correct (it passes the testsuite, but 
the testsuite was not written with this kind of abuse in mind).

Notice how discrete the pragmas are. When they are ignored (normal mode), 
the code is the same as the current one. If I now compile gmp with 
-fopenmp, the runtime of a multiplication between integers of 10^6 limbs 
is reduced by 20 to 30% when I enable a second thread.

There are other places (in fft, in toom, but also in higher levels like 
mpq) where similar things are possible, but with just a dual-core laptop 
it did not make sense to push too far. The FFT code is only used for large 
numbers, but in other places it would make sense to only allow parallelism 
above some threshold, like (untested):
#pragma omp parallel if(nl>5000)

Now this was just a quick experiment for myself, to check my understanding 
of openmp and gmp, I am in no way implying that the choice to concentrate 
gmp on a single-threaded implementation is wrong, as it indeed seems 
better to do several multiplications in parallel rather than parallelize 
the implementation of a single multiplication.

-- 
Marc Glisse