[PATCH] T3/T4 sparc shifts, plus more timings
davem at davemloft.net
Tue Mar 26 21:39:03 CET 2013
From: Torbjorn Granlund <tg at gmplib.org>
Date: Tue, 26 Mar 2013 21:29:38 +0100
> David Miller <davem at davemloft.net> writes:
> The ASI is defined on 64-byte cache lines.
> Overlapping is not an issue, since my 64-bytes-at-a-time loop loads
> everything first before doing any stores. It has no difference in
> behavior to the generic 64-bit copyi/copyd we have on sparc64. Each
> loop iteration only writes to a full, aligned, 64-byte block of
> I see. I bulky structure, but that's necessary in this case. Avoiding
> a load cache miss for write operations will save a lot of memory cycles.
> This help both with latncy and actual memory bandwidth.
> If one is really cracy, it might actually help to perform a dummy read
> operation of, say, 4Kibyte of the source operand. This will keep DRAM
> busy, putting sata in L1 cache. Then we do the copy, using the ASN you
> mentioned for writing; the loads will hit cache since we put them there.
We have prefetch instruction on sparc which we could use for this.
And in fact that's what I do for memcpy on the various Niagara chips,
prefetch about 256 bytes ahead, and using cache line initializing
> > The T3 popount and hamdist timing numbers are awful.
> > Is the C code perhaps faster?
> The C code won't be faster. It's slow on T3 because popc, like
> multiplies, simply isn't pipelined at all.
> The C code stays away from popc, and instead uses bit tricks.
I know, this was implied in this conversation. The bit tricks are
still more expensive than popc at least on T3.
More information about the gmp-devel