K8,K10 performance issues
Torbjorn Granlund
tg at gmplib.org
Tue Feb 15 15:21:41 CET 2011
Issue 1:
A while ago, I tracked down why the GMP powmod performance isn't as good
as I think it could be on the AMD K8-K10 processors (Opteron, Athlon64,
Phenom).
The diagram at http://gmplib.org/~tege/k8-addmul_2.png shows part of the
problem. The diagram was generated using this command:
tune/speed -p10000000 -Pfoo -C -s2-200 mpn_addmul_2
This is not a pretty curve. There is a major penalty for sizes n = 2
(mod 4).
I don't know the reason for this behaviour. There seem perhaps
different raggedness under n = 50 and over n = 50, but I think that's
not entirely true.
(I think this is the explanation for the performance drop at n = 50: The
branch predictor handles loops for up to around 12 [differs slightly
between K8/K9 and K10] with perfect branch prediction both for the
repeated iterations [i.e., branch taken] and for the loop exit [i.e.,
branch not taken]. At n = 50 we start seeing the penalty for a branch
misprediction at loop exit, since the 4-way unrolled loop gives a trip
count of about floor(50/4) = 12.)
The raggedness ought to be fixed.
Issue 2:
Another issue is sqr_basecase. It starts with an inlined mpn_mul_1 or
mpn_mul_2 depending on of the size n is even or odd, then iterates an
inlined addmul_2 until its tripcounts would be stupidly small. The
"corner" is implemented using a few multiplies as a straight line
program.
It should never use mul_1. It should always start with mul_2, then
iterate addmul_2 and leave two possible types of small "corners" til the
end. Using mul_1 initially, for a trip count of n-1, is silly since
mul_1 is slower per multiply than mul_2.
This too ought to be fixed by rewriting sqr_basecase.
--
Torbjörn
More information about the gmp-devel
mailing list