neon logops

Fri Mar 8 12:46:58 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  Building on the copyi that tege committed the other day, use neon for
  the logical operations too.

  I did both a 128-bit aligned version,

  > $ ./speed-128 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n
  > clock_gettime is 1.000ns accurate
  > overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz
  >             mpn_and_n    mpn_nand_n
  > 10            #1.7987        1.8986
  > 50            #0.9393        1.0692
  > 100           #1.2491        1.3890
  > 500           #0.8154        0.9753
  > 1000          #0.7786        0.9435
  > 5000          #1.4955        1.5765
  > 10000         #1.6532        1.7415

  and a 256-bit aligned version, just to see if having a higher ratio of
  operation insns to memory insns would help,

  > $ ./speed-256 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n
  > clock_gettime is 1.000ns accurate
  > overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz
  >             mpn_and_n    mpn_nand_n
  > 10            #1.5989        1.6988
  > 50            #1.0992        1.1592
  > 100           #1.0393        1.0593
  > 500           #1.0373        1.0413
  > 1000          #1.0303        1.0313
  > 5000          #1.5914        1.6003
  > 10000          1.6824       #1.6768

  It's a bit curious how the later is less "jaggy", but slightly slower.

I assume you mean that the destination ptr are naturally aligned, while
the source ptrs are 32-bit aligned?

My guess for the "jaggyness" is that of two src ptrs, you rarely strike
a case where they are 256-bit aligned, in particular not when both are
256-bit aligned.  But that happens much more often for 128-bit
alignment.  My copy was alignment insensitive, perhaps thanks to
scheduling, or that it stresses the unaligned load logic less, with its
one load-per-store?

You can play with -x -y -w -W to force alignment.  They are for src1,
src2, dst1, dst2, respectively, IIRC.  0 would mean "aligned", except
that's not too well-defined.  1 means the pointer mod 2^something = 1,
etc.

-- 
Torbjörn