[PATCH] T3/T4 sparc shifts, plus more timings

Sun Mar 31 23:20:39 CEST 2013

From: David Miller <davem at davemloft.net>
Date: Sun, 31 Mar 2013 13:36:35 -0400 (EDT)

> From: Torbjorn Granlund <tg at gmplib.org>
> Date: Sun, 31 Mar 2013 05:03:10 +0200
> 
>> The lshiftc code runs at 3 c/l on US3, not the claimed 2.5 c/l.  I
>> suspect also the US1 claim if 2 c/l is invalid.
> 
> So I've solved the puzzle of why we get 3 c/l for lshiftc on US3.
> 
> If the loop executes one time, it runs in 5 cycles.
> 
> But if it executes more than one time, each iteration after the
> first executes in 6 cycles.
> 
> The reason is that the final cycle:

No, this isn't the reason.  Back to the drawing board.

If the chip is integrating the first instruction in the loop
into the final instruction group of the loop, then we should
iterate between different groupings of the loop.

The initial iteration of the loop would group as:

        not     %l3, %l3
        sllx    u0, cnt, %l2
        ldx     [n + u0_off], u0        C WAS: up - 16

        andn    %l3, %l4, r0
        srlx    u1, tcnt, %l5
        stx     r0, [n + r0_off]        C WAS: rp - 8

        subcc   n, (2 * 8), n
        not     %l2, %l2

        sllx    u1, cnt, %l3
        andn    %l2, %l5, r1
        ldx     [n + u1_off], u1        C WAS: up - 24

        srlx    u0, tcnt, %l4
        bge,pt  %xcc, L(top)
         stx    r1, [n + r1_off]        C WAS: rp - 16
	not	%l3, %l3

and iterations 2, 4, 6, 8, etc. would group as:

        sllx    u0, cnt, %l2
        ldx     [n + u0_off], u0        C WAS: up - 16
        andn    %l3, %l4, r0

        srlx    u1, tcnt, %l5
        stx     r0, [n + r0_off]        C WAS: rp - 8
        subcc   n, (2 * 8), n

        not     %l2, %l2
        sllx    u1, cnt, %l3

        andn    %l2, %l5, r1
        ldx     [n + u1_off], u1        C WAS: up - 24
        srlx    u0, tcnt, %l4
        bge,pt  %xcc, L(top)

         stx    r1, [n + r1_off]        C WAS: rp - 16
	not	%l3, %l3
        sllx    u0, cnt, %l2

and iterations 3, 5, 7, 9, etc. would group as:

        ldx     [n + u0_off], u0        C WAS: up - 16
        andn    %l3, %l4, r0
        srlx    u1, tcnt, %l5

        stx     r0, [n + r0_off]        C WAS: rp - 8
        subcc   n, (2 * 8), n
        not     %l2, %l2

        sllx    u1, cnt, %l3
        andn    %l2, %l5, r1
        ldx     [n + u1_off], u1        C WAS: up - 24

        srlx    u0, tcnt, %l4
        bge,pt  %xcc, L(top)
         stx    r1, [n + r1_off]        C WAS: rp - 16
	not	%l3, %l3

I've verified these groupings in a test harness that simply lays out
three iterations of the loop right after one another, and replaces the
conditional branches with unconditional branches forward to the
instruction after the branch's delay slot.

However there seems to be something strange about UltraSPARC-III and
loop branches.  As the most simplest example, consider:

	mov       N, %i2
	subcc     %i2, 1, %i2

1:
	bne,pt	%xcc, 1b
	 subcc	%i2, 1, %i2

For values of N >= 1 we would expect 1 cycle per iteration.  But
that's not exactly what happens.

N		cycles
======================
1		2
2		3
3		4
4		5
5		6
6		8
7		11
8		14
9		17
10		20

Things look fine until we get to N=6, the extra loop iteration seems
to take 2 cycles instead of 1, and from N=7 onward the loop takes
3 cycles to execute.

I've tried aligning the first instruction of the loop at different
offsets, and this doesn't make any difference.

Instruction fetch starvation?  Unlikely, the required instructions
in the group (2) is half of the fetch size, and we're hitting the
I-cache every time since my test programs time the loop multiple
times.

Perhaps there is something wonky with the branch predictor on these
chips.  Information is incredibly sparse in this area, so it's hard
to say what might or might not be happening.

Ho hum, I'll keep plugging away at this.