ARM Neon popcount
Torbjorn Granlund
tg at gmplib.org
Thu Feb 28 10:02:22 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
What about vldm? Like
vldm up!, {q0,q1,q2,q3}
As far as I understand the manual, it supports a larger number of
registers. The registers must be consecutive, but that's no problem
here.
I added a long list of things to try. A new version is attached.
I consider you to be the A9 optimisation guy, and hope that you will
tweak all great A15 loops into running wonderfully also on A9.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm-popcount.asm
Type: application/octet-stream
Size: 3289 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130228/0893ae21/attachment-0001.obj>
-------------- next part --------------
A trick to explore that can make the code much simpler and also give
high degree of alignment is this:
Align pointers by masking, to (say) a 128bit/16byte boundary. Load from
there, reading outside the defined area (unless the pointer was already
aligned). Mask the read data using (ptr mod 16). Then do any feed-in,
looping, and wind-down. Repeat the trick at the end of the operand. It
will be simplest if we arranged for the looping to leave 1-16 bytes, not
0-15 bytes, so that we don't need to add conditions.
We get no love from Valgrind using these tricks, and Vincent will
lecture us that this is C is undefined when doing such things. My
defence against the former is that Valgrind has an option for tolerating
our behaviour, but I must concede that our assembly code is indeed
completely undefined when interpreted as C. :-)
--
Torbjörn
More information about the gmp-devel
mailing list