gmpbench -- how to utilize all CPU cores?

Sat Sep 28 09:55:51 CEST 2013

Hello everybody,

I have to backup Fredrik here. I am working in a research group at the 
High Performance Computing Centre in Stuttgart, Germany (HLRS). It /is/ 
intrinsic to an implementation whether it scales well on multi-core 
systems. For the reasons that Fredrik mentioned: out of Cache memory 
access, data reuse, cache prediction. Of course it is platform 
dependent, whether there is shared cache, how big the cache and cache 
lines are etc. But some bad ideas are bad on any architecture.

Also take into account that there are "many-core" systems coming up with 
tens and hundreds of cores. A library that doesn't /show/ to perform 
well on these systems will never be used by serious users. Therefore 
"believing" is not enough (@Torbjorn ;) )

Moreover, it is also important for end users to speed up a single 
calculation using parallelism.

So: yes, parallel performance is important. No, a serial code does not 
scale automatically if you run it side by side in a parallel 
environment. Yes, one has to show parallel performance and scalability 
(weak and strong!) with appropriate benchmarks.

Cheers, Steffen

On 27.09.2013 16:53, Torbjorn Granlund wrote:
> Fredrik Johansson <fredrik.johansson at gmail.com> writes:
>
>    The parallel overhead certainly depends on things specific to GMP.. A
>    function that does a lot of out-of-cache memory access might run fast
>    on a single core while parallel instances slow down since the cores
>    are competing for memory access. If the function is modified so that
>    it works on chunks of data that fit in the cache of each core, it will
>    run about as fast regardless of how many parallel instances there are.
>    Taking this aspect into account when tuning GMP is clearly of interest
>    for people who use their multicore systems to run several GMP-based
>    computations in parallel.
>    
> I believe the cache friendliness of GMP to be quite good.
>
> An exception is the FFT code, in particular when used for huge operands.
> The coefficients of the transformed polynomial have size O(sqrt(n)) when
> producing an n-bit product.  When n is large enough to make one sqrt(n)
> large coefficient not fit in the cache, presumably things start to run
> quite poorly.
>
> We have new FFT code which uses limb sized coefficients.  Using small
> coefficients, only the transform length can cause cache problem, but the
> new FFT code uses the Cooley-Tukey decomposition trick which allows for
> an arbitrary reduction of the transform size at the expense of some
> extra twiddle factor multiplies.
>
> The code can be found here: http://gmplib.org:8000/gcd-nisse/
>