gmpbench -- how to utilize all CPU cores?

. dcMhOYBdpZkH at
Sat Sep 28 20:44:11 CEST 2013

I agree with you. Someone skilled shall parallelize the code and at the
same time teach the people through the open source code how to properly
parallelize the code in GMP.

I guess I could parallelize some of the benchmarks with pthreads(? not
looked at the code yet), but I wouldn't know what's the proper way (what
about OpenCL, ...) of doing it, I'm too of a beginner. Did some projects
using pthreads and GMP, but not combined.

Also, remember, the code would't be purely for the benchmark, but would
help people with parallelization of their GMP projects. Or why do we
have multi-cored CPUs then, and what about GPUs? :) The future is
parallelization, since we have reached the GHz limit I guess, you know
this. Hm, there must be some 'GMP parallel programming' tutorials out
there, but I couldn't find anything particular. If there aren't any,
(wow,) it's about time for a tutorial section on :)

I hope you don't have to re-write some of the GMP (GMP 6.0 anyone?) code
to make the parallelization work :-D

On 28.09.2013 09:55, Steffen Brinkmann wrote:
> Hello everybody,
> I have to backup Fredrik here. I am working in a research group at the
> High Performance Computing Centre in Stuttgart, Germany (HLRS). It
> /is/ intrinsic to an implementation whether it scales well on
> multi-core systems. For the reasons that Fredrik mentioned: out of
> Cache memory access, data reuse, cache prediction. Of course it is
> platform dependent, whether there is shared cache, how big the cache
> and cache lines are etc. But some bad ideas are bad on any architecture.
> Also take into account that there are "many-core" systems coming up
> with tens and hundreds of cores. A library that doesn't /show/ to
> perform well on these systems will never be used by serious users.
> Therefore "believing" is not enough (@Torbjorn ;) )
> Moreover, it is also important for end users to speed up a single
> calculation using parallelism.
> So: yes, parallel performance is important. No, a serial code does not
> scale automatically if you run it side by side in a parallel
> environment. Yes, one has to show parallel performance and scalability
> (weak and strong!) with appropriate benchmarks.
> Cheers, Steffen
> On 27.09.2013 16:53, Torbjorn Granlund wrote:
>> Fredrik Johansson <fredrik.johansson at> writes:
>>    The parallel overhead certainly depends on things specific to GMP.. A
>>    function that does a lot of out-of-cache memory access might run fast
>>    on a single core while parallel instances slow down since the cores
>>    are competing for memory access. If the function is modified so that
>>    it works on chunks of data that fit in the cache of each core, it
>> will
>>    run about as fast regardless of how many parallel instances there
>> are.
>>    Taking this aspect into account when tuning GMP is clearly of
>> interest
>>    for people who use their multicore systems to run several GMP-based
>>    computations in parallel.
>>    I believe the cache friendliness of GMP to be quite good.
>> An exception is the FFT code, in particular when used for huge operands.
>> The coefficients of the transformed polynomial have size O(sqrt(n)) when
>> producing an n-bit product.  When n is large enough to make one sqrt(n)
>> large coefficient not fit in the cache, presumably things start to run
>> quite poorly.
>> We have new FFT code which uses limb sized coefficients.  Using small
>> coefficients, only the transform length can cause cache problem, but the
>> new FFT code uses the Cooley-Tukey decomposition trick which allows for
>> an arbitrary reduction of the transform size at the expense of some
>> extra twiddle factor multiplies.
>> The code can be found here:
> _______________________________________________
> gmp-discuss mailing list
> gmp-discuss at

OpenPGP key:

More information about the gmp-discuss mailing list