neon logops

Fri Mar 8 18:44:02 CET 2013

On 2013-03-08 03:46, Torbjorn Granlund wrote:
> I assume you mean that the destination ptr are naturally aligned, while
> the source ptrs are 32-bit aligned?

Yes.

> My guess for the "jaggyness" is that of two src ptrs, you rarely strike
> a case where they are 256-bit aligned, in particular not when both are
> 256-bit aligned.  But that happens much more often for 128-bit
> alignment.  My copy was alignment insensitive, perhaps thanks to
> scheduling, or that it stresses the unaligned load logic less, with its
> one load-per-store?

I don't know.  I do know there's something bizzare going on that's probably 
needs some chip knowledge to figure out.

For instance, testing the -128 patch I posted here, and making no other change 
except *adding* :128 markers to both source operands, I hoped to determine what 
effect source alignment has on the loop.  (This change is not generally 
correct, but does work for the case of speed with specified alignment.)

The peak result is slightly *slower* than before.

		with align			 without align
             mpn_and_n    mpn_nand_n	     mpn_and_n    mpn_nand_n
10            #1.7989        1.8987		 1.7990        1.8989
50            #0.9393        1.0693		 0.9395        1.0694
100           #1.2491        1.3891		 1.2496        1.3893
500           #0.8154        0.9753		 0.8156        0.9756
1000           0.8746        1.0642		#0.7787        0.9435
5000          #1.4067        1.4939		 1.5012        1.5577
10000         #1.5454        1.6702		 1.5521        1.5926