ARM Neon popcount

Richard Henderson rth at
Thu Feb 28 00:08:26 CET 2013

On 2013-02-27 14:33, Torbjorn Granlund wrote:
>    	vld1.32		{ q1, q2 }, [r0 at 128]!
>    As specified in section A.3.2.1, if you specify the alignment it will
>    also be checked, so you'll get SIGBUS if its not right.
> I wanted to experiment, but I cannot find any syntax which is accepted
> by gas.  @128 does not work (in gas 2.22).

I had to use the disassembler to figure it out.  Gas uses a colon.

	vld1.64	{d0-d3}, [r0:128]

Which while not obvious, I should have figured it had be something else since 
"@" begins a comment in ARM assembly.

And, I lied about not being able to read 4 128-bit registers in one insn.
You can't do it with VLD[1-4], but you can with VLDM.

Something else to look at is whether VLDR and VLDM perform better on A9.

The one thing that you do have to worry about there is that VLD[1-4] load 
consecutive "elements" as defined by the data type, whereas VLD[RM] load full 
64-bit registers.  This distinction matters in big-endian mode.

Of course, the big-endian caveat doesn't apply to popcount.

>    It trades 1 vpaddl for two add insns, but the total latency is
>    probably a cycle or two better since we're now operating in core.
> Need to test that, I think.  I fear the corereg<->vreg bandwidth might
> be poor.

If it's awful, one could perform the final fold with "vadd.i64 d16, d17" and 
perform only one move to r0, swallowing that latency in the function return.


More information about the gmp-devel mailing list