[PATCH] T3/T4 sparc shifts, plus more timings

Tue Mar 26 17:07:34 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 26 Mar 2013 14:28:34 +0100

> David Miller <davem at davemloft.net> writes:
> 
>   These give a modest speedup compared to the T1 routines.
>   I also added missing T3 timings to existing code.
>   
> The first thing to try then is finding code that runs well on both.
> There is a cost in having more variants than we need.

It runs well on both, it's scheduled for T4 but for T3 what matters
is hitting the load latency and pure number of operations.

>   Also, I worked on a copyi/copyd for T3/T4 that uses cache-initializing
>   stores (basically, if you're going to write a full aligned 64-byte
>   cache line, you tell the chip by using a special ASI in the stores,
>   and the cpu will simply clear the entire cache line on write to the
>   first word of the cache line, eliminating all the memory traffic).
> 
> That will be good, but we need to watch for a few things:
> 
> * We must make sure not to use it near the beginning or end of operands.
> 
> * What if the code runs on a machine with 128-byte lines, then will it
>   risk to clobber the area just outside our operands?  Or is the ASI
>   defined to mean "64 byte cache line"?
> 
> * We need to watch out for overlapping copies, mpn_copyi(p,p+off,n)
>   where off might be any number >= 0.

All of these issues are easy to address.

The ASI is defined on 64-byte cache lines.

Overlapping is not an issue, since my 64-bytes-at-a-time loop loads
everything first before doing any stores.  It has no difference in
behavior to the generic 64-bit copyi/copyd we have on sparc64.  Each
loop iteration only writes to a full, aligned, 64-byte block of
memory.

But like I said I think this facility has a much limited applicability
here.

> The T3 popount and hamdist timing numbers are awful.
> Is the C code perhaps faster?

The C code won't be faster.  It's slow on T3 because popc, like
multiplies, simply isn't pipelined at all.

I looked into this when I started using popc in some Linux kernel
library routines, and C code only wins for population counts on
16-bit and smaller quantities.

> ABout [lr]shift{c,}:
> 
> I think we should put generic code at the top-level dir, and make it
> work OK for T1 too.  That code should not rely on out-of-order
> execution.
> 
> lshiftc: Why both 'xnor' and 'andn'.  Is there no 'nor' insn?  Then
> I'd suggest xnor (or some pseudo insn for 'not') and 'or' for
> blending.

I've found xnor to be the canonical way to reverse bits on sparc, I can
use the pseudo-op if you want.

> Have you tried software pipelined loops, 2-way or 4-way unrolled,
> instead of your 2-way non-pipelined loops?

Sure I can look into this.