Parallelizing GMP

Sat Oct 29 10:03:41 CEST 2011

Hello,

On Sat, 29 Oct 2011, Kentaro Yamamoto wrote:

> I am using GMP for computing modular exponentiation by repeated
> squaring (since the power are always a power of 2 and the modulus has
> such a good property that a special way of computing remainder can be
> used, I'm doing it by myself instead of using mpz_powm().)  The modulus
> is bigger than MUL_FFT_THRESHOLD and the time needed by multiplication
> accounts for almost all of the running time of my application.
>
> I have an Intel Core 2 Quad processor, and would like my application to
> scale on it.  Since I don't know how to introduce parallelism in the
> repeated squaring part (if you do, please let me know), I am planning
> to parallelize in the library level by OpenMP.

do you only compute the exponentiation of a single number? When you 
parallelize, the higher the level, the greater the gains, and computing 
the exponentiation of several numbers in parallel sounds like it would 
work best.

> I don't know much about
> how the algorithms used are combined (it seems more complicated than the
> manual describes), but firstly I added `#pragma omp for' line at the
> beginning of the for loop in the base case of mpn_fft_mul_modF_K().
> The running time almost halved, but since there are four cores, I would
> like the application to run more faster.
>
> Do you know other good places to put OpenMP directives?

First, a gain of a factor 2 on 4 processors, when you parallelize at a low 
level, is a great gain!

Some algorithms parallelize better than others, and the one currently in 
gmp wasn't written for that purpose. If I understand correctly, the FFT 
multiplication first computes the transform of each operand (you could try 
doing those 2 in parallel), then multiplies the 2 transforms componentwise 
(what you already parallelized) and eventually retransforms to get the 
result. If you are close to the FFT threshold, you could also try 
parallelizing the toom multiplications. Note that making things parallel 
completely changes the thresholds and other tunings. I am not sure if 
running make tune and recompiling is enough to fix that (and running times 
with openmp are likely too random for the tuning to work).

See also this discussion (but beware that the code is slightly different):
http://groups.google.com/group/mpir-dev/browse_thread/thread/d00c8765cacf600a

-- 
Marc Glisse