[PATCH] T3/T4 sparc shifts, plus more timings
Torbjorn Granlund
tg at gmplib.org
Sun Mar 31 23:33:36 CEST 2013
David Miller <davem at davemloft.net> writes:
For values of N >= 1 we would expect 1 cycle per iteration. But
that's not exactly what happens.
N cycles
======================
1 2
2 3
3 4
4 5
5 6
6 8
7 11
8 14
9 17
10 20
Things look fine until we get to N=6, the extra loop iteration seems
to take 2 cycles instead of 1, and from N=7 onward the loop takes
3 cycles to execute.
I've tried aligning the first instruction of the loop at different
offsets, and this doesn't make any difference.
Instruction fetch starvation? Unlikely, the required instructions
in the group (2) is half of the fetch size, and we're hitting the
I-cache every time since my test programs time the loop multiple
times.
Perhaps there is something wonky with the branch predictor on these
chips. Information is incredibly sparse in this area, so it's hard
to say what might or might not be happening.
I think they messed up "predicted taken" and "predicted non-taken" at
the gate level. So for enough iterations, the predictor
considers--correctly--that the branch will be taken. And then the
misinterpres it.
The loop branch back is fast only when it is predicted non-taken.
:-)
--
Torbjörn
More information about the gmp-devel
mailing list