ARM Neon popcount

Torbjorn Granlund tg at
Thu Feb 28 10:02:22 CET 2013

nisse at (Niels Möller) writes:

  What about vldm? Like
  	vldm	up!, {q0,q1,q2,q3}
  As far as I understand the manual, it supports a larger number of
  registers. The registers must be consecutive, but that's no problem
I added a long list of things to try.  A new version is attached.

I consider you to be the A9 optimisation guy, and hope that you will
tweak all great A15 loops into running wonderfully also on A9.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm-popcount.asm
Type: application/octet-stream
Size: 3289 bytes
Desc: not available
URL: <>
-------------- next part --------------

A trick to explore that can make the code much simpler and also give
high degree of alignment is this:

Align pointers by masking, to (say) a 128bit/16byte boundary.  Load from
there, reading outside the defined area (unless the pointer was already
aligned).  Mask the read data using (ptr mod 16).  Then do any feed-in,
looping, and wind-down.  Repeat the trick at the end of the operand.  It
will be simplest if we arranged for the looping to leave 1-16 bytes, not
0-15 bytes, so that we don't need to add conditions.

We get no love from Valgrind using these tricks, and Vincent will
lecture us that this is C is undefined when doing such things.  My
defence against the former is that Valgrind has an option for tolerating
our behaviour, but I must concede that our assembly code is indeed
completely undefined when interpreted as C.  :-)


More information about the gmp-devel mailing list