[PATCH] T3/T4 sparc shifts, plus more timings

Torbjorn Granlund tg at gmplib.org
Tue Mar 26 21:29:38 CET 2013


David Miller <davem at davemloft.net> writes:

  >   These give a modest speedup compared to the T1 routines.
  >   I also added missing T3 timings to existing code.
  >   
  > The first thing to try then is finding code that runs well on both.
  > There is a cost in having more variants than we need.
  
  It runs well on both, it's scheduled for T4 but for T3 what matters
  is hitting the load latency and pure number of operations.
  
I worry about T1 too, to keep the number of code variants under control.

  The ASI is defined on 64-byte cache lines.
  
  Overlapping is not an issue, since my 64-bytes-at-a-time loop loads
  everything first before doing any stores.  It has no difference in
  behavior to the generic 64-bit copyi/copyd we have on sparc64.  Each
  loop iteration only writes to a full, aligned, 64-byte block of
  memory.
  
I see.  I bulky structure, but that's necessary in this case.  Avoiding
a load cache miss for write operations will save a lot of memory cycles.
This help both with latncy and actual memory bandwidth.

If one is really cracy, it might actually help to perform a dummy read
operation of, say, 4Kibyte of the source operand.  This will keep DRAM
busy, putting sata in L1 cache.  Then we do the copy, using the ASN you
mentioned for writing; the loads will hit cache since we put them there.

Why is this beneficial?  Because DRAM needs many cycles for direction
change; it wants to either get data for a long time or put data for a
long time.

  But like I said I think this facility has a much limited applicability
  here.
  
Yep, I suppose this is overkill in GMP, but such code should be put in
glibc for each and every machine...

  > The T3 popount and hamdist timing numbers are awful.
  > Is the C code perhaps faster?
  
  The C code won't be faster.  It's slow on T3 because popc, like
  multiplies, simply isn't pipelined at all.
  
The C code stays away from popc, and instead uses bit tricks.

  I've found xnor to be the canonical way to reverse bits on sparc, I can
  use the pseudo-op if you want.
  
It looked strange with andn + xnor.  I don't cary much about canonical
vs pseudo-op.

-- 
Torbjörn


More information about the gmp-devel mailing list