gmpbench -- how to utilize all CPU cores?
tg at gmplib.org
Fri Sep 27 16:53:13 CEST 2013
Fredrik Johansson <fredrik.johansson at gmail.com> writes:
The parallel overhead certainly depends on things specific to GMP.. A
function that does a lot of out-of-cache memory access might run fast
on a single core while parallel instances slow down since the cores
are competing for memory access. If the function is modified so that
it works on chunks of data that fit in the cache of each core, it will
run about as fast regardless of how many parallel instances there are.
Taking this aspect into account when tuning GMP is clearly of interest
for people who use their multicore systems to run several GMP-based
computations in parallel.
I believe the cache friendliness of GMP to be quite good.
An exception is the FFT code, in particular when used for huge operands.
The coefficients of the transformed polynomial have size O(sqrt(n)) when
producing an n-bit product. When n is large enough to make one sqrt(n)
large coefficient not fit in the cache, presumably things start to run
We have new FFT code which uses limb sized coefficients. Using small
coefficients, only the transform length can cause cache problem, but the
new FFT code uses the Cooley-Tukey decomposition trick which allows for
an arbitrary reduction of the transform size at the expense of some
extra twiddle factor multiplies.
The code can be found here: http://gmplib.org:8000/gcd-nisse/
More information about the gmp-discuss