Discrepancy in measuring performance?
William E. Skeith III
wes_zilla at yahoo.com
Mon Apr 2 20:15:23 CEST 2012
I have a question regarding measuring the performance of GMP. The gist of it
is, I can't quite make my own test code (which calls GMP) perform the way I
expect, and was really hoping someone could help. Here are the details:
As far as I can tell, the configuration is selecting all the right assembly
routines for my processor (an intel core 2). I built src/tune/speed and ran
some tests. Here are the results for a test of multiplication (naive, long
multiplication, I think):
$ ./speed -s 16-32 -t 8 -c mpn_mul_basecase
overhead 6.13 cycles, precision 10000 units of 7.69e-10 secs, CPU freq 1300.00 MHz
Doing some math, we see that for 1024 bit integers, we are spending 1192.10 /
256 = 4.657 clock cycles for every multiplication instruction (we have to do
16**2 = 256 for long multiplication). Considering that the theoretical lower
bound for throughput of the multiplication instruction on a core 2 is 4 clock
cycles (last I heard), that's pretty good!
*Now the problem*: I wrote my own test code to make sure I could get this
performance in practice, and it does not perform so well. Perhaps there is
something fundamentally wrong with my method, but I cannot see it. Any help
would be greatly appreciated. Here are the details: I created two random
integers of the above sizes, and in a loop of 100000 iterations, multiplied
them together with GMP. I recorded the TSC before and after, and then divided
the difference by 100000. For 16 limbs (1024 bits) I get 2100 clock cycles,
compared to the 1192 above. For 32, I get 7000. This is not even close.
(Note: the threshold for fancy multiplication was around 23 limbs, so I guess
that is why it seems to be improving at 32; I told ./speed to stick with long
Does anyone know what on earth I'm doing wrong??? I have a hard time
believing that little errors in measurement aren't going to be drowned out in
100000 trials, which I've run well over 50 times today. I also measured time
with the system clock, and then computed the cycle count based on my clock
frequency, and obtained the exact same results.
Here are other things I've tried, but none were successful:
* My cpu has a constant TSC, so I set the frequency governor to always keep
the clock at the maximum (1.3GHz) so there wouldn't be any time wasted
getting it up to speed.
* To eliminate loop overhead (which should have been *tiny*) I called the
function repeatedly inside a single iteration.
* I statically linked to the GMP library.
* I messed around with different compiler / linker options, and tried to make
sure mine were similar to that of src/tune/Makefile.
* Instead of calling the high-level routines, I went straight to the "mpn_"
routines to act on vectors of limbs directly.
* I tried different random numbers (although the run time of multiplication is
probably not very data-dependent, powers of 2 aside).
* I ran the code in the terminal after a fresh boot. No wm, no X, no network,
very few daemons; system was pretty much silent except for this test code.
None of this made more than a tiny bit of difference, as one would expect. As
you can probably tell, I'm getting desperate. If you somehow have the
patience, I've linked to the (very short!) test code below. I am frustrated, out of
ideas, and probably doing something horrendously dumb. : ( Many thanks for
Sample output of above program:
cflags: -march=native -O2 -pipe
compiler: gcc -std=gnu99
gmp ver. 5.0.4
average TSCs: 7120.15
And here are the compiler options I used (some options are probably irrelevant,
but since src/tune/Makefile had them, I put them in too):
g++ -march=native -O2 -Wall -std=c++0x -Wl,--sort-common,--as-needed,-z,relro,--hash-style=gnu -static main.cpp -o main -lrt -lgmpxx -lgmp
More information about the gmp-discuss