Parallelizing GMP

Mon Oct 31 06:36:50 CET 2011

Hello,

On Sat, 29 Oct 2011 21:52:00 +0200 (CEST)
bodrato at mail.dm.unipi.it wrote:

> >> repeated squaring part (if you do, please let me know), I am planning
> >> to parallelize in the library level by OpenMP.
> >
> > do you only compute the exponentiation of a single number? When you
> > parallelize, the higher the level, the greater the gains, and computing
> 
> The higher level you can try to parallelize, in the library, is
> mpn/generic/sqrmod_bnm1.c, it "Compute xm = a^2 mod (B^n - 1), xp = a^2
> mod (B^n + 1)", and the two computation can go parallel. They are already
> enclosed in two {}-blocks. Unfortunately inserting a few pragma is not
> enough, I fear, because the two computation probably share some scratch
> memory area.
> But if you do not care of the increased memory footprint, this can be a
> good place to start.

I inserted #pragma omp sections lines, allocated scratch memory for
each thread, and set OMP_NESTED=TRUE, OMP_DYNAMIC=TRUE and
OMP_THREAD_LIMIT=4, so that threads will be created recursively up to
the number of cores.

The running time was as long as that of the paralellized FFT version.
It seems it didn't use up the cores: although my machine has four
cores, the CPU time was twice as long as the wall time.  Switching off
nested parallelism slightly reduced the wall time.

In short, parallelizing sqrmod_bnm1 didn't help much.  Am I using
OpenMP in a wrong way?  I suspect nested parallel regions are doing
bad things, and that if so, parallelizing Toom (which I haven't tried
yet) will be difficult, because Toom might possibly be called from
parallelized FFT routine.

Regards,
-- 
Kentaro Yamamoto <call at taku2ro.dyndns.tv>

 LocalWords:  parallelizing