[PATCH] T3/T4 sparc shifts, plus more timings

Sun Mar 31 23:33:36 CEST 2013

David Miller <davem at davemloft.net> writes:

  For values of N >= 1 we would expect 1 cycle per iteration.  But
  that's not exactly what happens.

  N		cycles
  ======================
  1		2
  2		3
  3		4
  4		5
  5		6
  6		8
  7		11
  8		14
  9		17
  10		20

  Things look fine until we get to N=6, the extra loop iteration seems
  to take 2 cycles instead of 1, and from N=7 onward the loop takes
  3 cycles to execute.

  I've tried aligning the first instruction of the loop at different
  offsets, and this doesn't make any difference.

  Instruction fetch starvation?  Unlikely, the required instructions
  in the group (2) is half of the fetch size, and we're hitting the
  I-cache every time since my test programs time the loop multiple
  times.

  Perhaps there is something wonky with the branch predictor on these
  chips.  Information is incredibly sparse in this area, so it's hard
  to say what might or might not be happening.

I think they messed up "predicted taken" and "predicted non-taken" at
the gate level.  So for enough iterations, the predictor
considers--correctly--that the branch will be taken.  And then the
misinterpres it.

The loop branch back is fast only when it is predicted non-taken.

:-)

-- 
Torbjörn