neon logops
Torbjorn Granlund
tg at gmplib.org
Fri Mar 8 12:46:58 CET 2013
Richard Henderson <rth at twiddle.net> writes:
Building on the copyi that tege committed the other day, use neon for
the logical operations too.
I did both a 128-bit aligned version,
> $ ./speed-128 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n
> clock_gettime is 1.000ns accurate
> overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz
> mpn_and_n mpn_nand_n
> 10 #1.7987 1.8986
> 50 #0.9393 1.0692
> 100 #1.2491 1.3890
> 500 #0.8154 0.9753
> 1000 #0.7786 0.9435
> 5000 #1.4955 1.5765
> 10000 #1.6532 1.7415
and a 256-bit aligned version, just to see if having a higher ratio of
operation insns to memory insns would help,
> $ ./speed-256 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n
> clock_gettime is 1.000ns accurate
> overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz
> mpn_and_n mpn_nand_n
> 10 #1.5989 1.6988
> 50 #1.0992 1.1592
> 100 #1.0393 1.0593
> 500 #1.0373 1.0413
> 1000 #1.0303 1.0313
> 5000 #1.5914 1.6003
> 10000 1.6824 #1.6768
It's a bit curious how the later is less "jaggy", but slightly slower.
I assume you mean that the destination ptr are naturally aligned, while
the source ptrs are 32-bit aligned?
My guess for the "jaggyness" is that of two src ptrs, you rarely strike
a case where they are 256-bit aligned, in particular not when both are
256-bit aligned. But that happens much more often for 128-bit
alignment. My copy was alignment insensitive, perhaps thanks to
scheduling, or that it stresses the unaligned load logic less, with its
one load-per-store?
You can play with -x -y -w -W to force alignment. They are for src1,
src2, dst1, dst2, respectively, IIRC. 0 would mean "aligned", except
that's not too well-defined. 1 means the pointer mod 2^something = 1,
etc.
--
Torbjörn
More information about the gmp-devel
mailing list