[PATCH] T3/T4 sparc shifts, plus more timings

Tue Mar 26 21:39:03 CET 2013

From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 26 Mar 2013 21:29:38 +0100

> David Miller <davem at davemloft.net> writes:
> 
>   The ASI is defined on 64-byte cache lines.
>   
>   Overlapping is not an issue, since my 64-bytes-at-a-time loop loads
>   everything first before doing any stores.  It has no difference in
>   behavior to the generic 64-bit copyi/copyd we have on sparc64.  Each
>   loop iteration only writes to a full, aligned, 64-byte block of
>   memory.
>   
> I see.  I bulky structure, but that's necessary in this case.  Avoiding
> a load cache miss for write operations will save a lot of memory cycles.
> This help both with latncy and actual memory bandwidth.
> 
> If one is really cracy, it might actually help to perform a dummy read
> operation of, say, 4Kibyte of the source operand.  This will keep DRAM
> busy, putting sata in L1 cache.  Then we do the copy, using the ASN you
> mentioned for writing; the loads will hit cache since we put them there.

We have prefetch instruction on sparc which we could use for this.
And in fact that's what I do for memcpy on the various Niagara chips,
prefetch about 256 bytes ahead, and using cache line initializing
stores.

>   > The T3 popount and hamdist timing numbers are awful.
>   > Is the C code perhaps faster?
>   
>   The C code won't be faster.  It's slow on T3 because popc, like
>   multiplies, simply isn't pipelined at all.
>   
> The C code stays away from popc, and instead uses bit tricks.

I know, this was implied in this conversation.  The bit tricks are
still more expensive than popc at least on T3.