# Threading (was: bipartite modular multiplication)

Marc Glisse marc.glisse at normalesup.org
Thu Apr 30 13:27:50 CEST 2009

```On Wed, 29 Apr 2009, Niels Möller wrote:

> When
> getting into the divide-and-conquer range, that seems harder (and
> I think threading at this level is not going to get into gmp).

Just out of curiosity, I looked at what happens when one tries some
trivial threading inside gmp, for a regular multiplication (so completely
unrelated to this discussion, actually). In mul_fft.c:

@@ -956,9 +967,17 @@
nl, ml, pl2, pl3, k2));

ASSERT_ALWAYS(pl3 <= pl);
-  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
+#pragma omp parallel
+{
+#pragma omp sections
+{
+#pragma omp section
+  mpn_mul_fft (op, pl3, n, nl, m, ml, k3);     /* mu */
+#pragma omp section
mpn_mul_fft (pad_op, pl2, n, nl, m, ml, k2); /* lambda */
+}
+}
cc = mpn_sub_n (pad_op, pad_op, op, pl2);    /* lambda - low(mu) */
/* 0 <= cc <= 1 */
l = pl3 - pl2; /* l = pl2 / 2 since pl3 = 3/2 * pl2 */

I am definitely not an openmp expert and I am not familiar with any MPN
code, I hope I still wrote something correct (it passes the testsuite, but
the testsuite was not written with this kind of abuse in mind).

Notice how discrete the pragmas are. When they are ignored (normal mode),
the code is the same as the current one. If I now compile gmp with
-fopenmp, the runtime of a multiplication between integers of 10^6 limbs
is reduced by 20 to 30% when I enable a second thread.

There are other places (in fft, in toom, but also in higher levels like
mpq) where similar things are possible, but with just a dual-core laptop
it did not make sense to push too far. The FFT code is only used for large
numbers, but in other places it would make sense to only allow parallelism
above some threshold, like (untested):
#pragma omp parallel if(nl>5000)

Now this was just a quick experiment for myself, to check my understanding
of openmp and gmp, I am in no way implying that the choice to concentrate
gmp on a single-threaded implementation is wrong, as it indeed seems
better to do several multiplications in parallel rather than parallelize
the implementation of a single multiplication.

--
Marc Glisse
```