K8,K10 performance issues

Tue Feb 15 15:21:41 CET 2011

Issue 1:

A while ago, I tracked down why the GMP powmod performance isn't as good
as I think it could be on the AMD K8-K10 processors (Opteron, Athlon64,
Phenom).

The diagram at http://gmplib.org/~tege/k8-addmul_2.png shows part of the
problem.  The diagram was generated using this command:

  tune/speed -p10000000 -Pfoo -C -s2-200 mpn_addmul_2

This is not a pretty curve.  There is a major penalty for sizes n = 2
(mod 4).

I don't know the reason for this behaviour.  There seem perhaps
different raggedness under n = 50 and over n = 50, but I think that's
not entirely true.

(I think this is the explanation for the performance drop at n = 50: The
branch predictor handles loops for up to around 12 [differs slightly
between K8/K9 and K10] with perfect branch prediction both for the
repeated iterations [i.e., branch taken] and for the loop exit [i.e.,
branch not taken].  At n = 50 we start seeing the penalty for a branch
misprediction at loop exit, since the 4-way unrolled loop gives a trip
count of about floor(50/4) = 12.)

The raggedness ought to be fixed.

Issue 2:

Another issue is sqr_basecase.  It starts with an inlined mpn_mul_1 or
mpn_mul_2 depending on of the size n is even or odd, then iterates an
inlined addmul_2 until its tripcounts would be stupidly small.  The
"corner" is implemented using a few multiplies as a straight line
program.

It should never use mul_1.  It should always start with mul_2, then
iterate addmul_2 and leave two possible types of small "corners" til the
end.  Using mul_1 initially, for a trip count of n-1, is silly since
mul_1 is slower per multiply than mul_2.

This too ought to be fixed by rewriting sqr_basecase.

-- 
Torbjörn