gmpbench -- how to utilize all CPU cores?

Fri Sep 27 16:53:13 CEST 2013

Fredrik Johansson <fredrik.johansson at gmail.com> writes:

  The parallel overhead certainly depends on things specific to GMP.. A
  function that does a lot of out-of-cache memory access might run fast
  on a single core while parallel instances slow down since the cores
  are competing for memory access. If the function is modified so that
  it works on chunks of data that fit in the cache of each core, it will
  run about as fast regardless of how many parallel instances there are.
  Taking this aspect into account when tuning GMP is clearly of interest
  for people who use their multicore systems to run several GMP-based
  computations in parallel.

I believe the cache friendliness of GMP to be quite good.

An exception is the FFT code, in particular when used for huge operands.
The coefficients of the transformed polynomial have size O(sqrt(n)) when
producing an n-bit product.  When n is large enough to make one sqrt(n)
large coefficient not fit in the cache, presumably things start to run
quite poorly.

We have new FFT code which uses limb sized coefficients.  Using small
coefficients, only the transform length can cause cache problem, but the
new FFT code uses the Cooley-Tukey decomposition trick which allows for
an arbitrary reduction of the transform size at the expense of some
extra twiddle factor multiplies.

The code can be found here: http://gmplib.org:8000/gcd-nisse/

-- 
Torbjörn