[PATCH] T3/T4 sparc shifts, plus more timings

David Miller davem at davemloft.net
Thu Mar 28 01:32:44 CET 2013


From: Torbjorn Granlund <tg at gmplib.org>
Date: Wed, 27 Mar 2013 22:07:47 +0100

> David Miller <davem at davemloft.net> writes:
> 
>   As an aside I think we can get it down to 2.5 cycles per limb on
>   T4 with 4-way unrolling, and 3.0 cycles per limb with 2-way
>   unrolling.
>   
>   The idea is to decrease the bookkeeping instructions by only
>   maintaining base pointers which do not change, and then we have an
>   offset which operates as the loop index.
>   
>   So we'd instead have an 'n_off' instead of 'n', and then in some local
>   registers we'd hold:
>   
>   l3:	up - 8
>   l4:	up - 16
>   l5:	rp - 8
>   l6:	rp - 16
>   
> A clever trick!  But you will probably get 2.75 c/l for 4-way, not 2.5
> c/l.  We'll need infinite unrolling for 2.5...

Hmmm, not sure I understand.  But anyways, it turns out we don't need to
use the fused compare-and-branch, here is what the 2-way loop looks like:

1:
        sllx    u0, cnt, %l2
        or      %l4, %l3, r0

        ldx     [n + u_off0], u0
        srlx    u1, tcnt, %l5

        stx     r0, [n + r_off0]
        sllx    u1, cnt, %l3

        or      %l2, %l5, r1
        ldx     [n + u_off1], u1

        subcc   n, (2 * 8), n
        srlx    u0, tcnt, %l4

        bge,pt  %xcc, 1b
         stx    r1, [n + r_off0]

Which is 6 cycles per loop, 3 cycles per limb.

On UltraSparc-1 et al. we could shuffle things around and get it
to schedule like this:

1:
        sllx    u0, cnt, %l2
        or      %l4, %l3, r0
        ldx     [n + u_off0], u0

        srlx    u1, tcnt, %l5
        stx     r0, [n + r_off0]
        subcc   n, (2 * 8), n

        sllx    u1, cnt, %l3
        or      %l2, %l5, r1
        ldx     [n + u_off1], u1

        srlx    u0, tcnt, %l4
        bge,pt  %xcc, 1b
         stx    r1, [n + r_off0]

The usual rules about these older chips applies, one shift per group,
when there are multiple integer instructions in a group the shift
must appear first.  Loads have a 2 cycle latency, and the above loop
places the read dependency 3 cycles after the load.

Unless I've made a mistake that's 4 cycles per loop, 2 cycles per
limb.  That's what the current 4-way unrolling code obtains.  In fact
I think that a 4-way unrolling using the above bookkeeping
simplification wouldn't execute any faster, because we'll have a
troublesome odd instruction at the end of the loop.

Note that the {u,r}_off{0,1} values need to be adjusted depending upon
where exactly the "subcc n, (2 * 8), n" instruction is placed.


More information about the gmp-devel mailing list