ARM Neon popcount

Wed Feb 27 23:13:21 CET 2013

On 2013-02-27 13:27, Torbjorn Granlund wrote:
> Specific questions:
> * I completely ignore alignment.  Is that bad?

I'm not sure about that.  It's something that perhaps we should 
experiment with.  As written, the code will work, as the chip will 
handle totally unaligned data.  What I don't know is whether 
*specifying* increased alignment in the insn helps.  E.g.

	vld1.32		{ q1, q2 }, [r0 at 128]!

As specified in section A.3.2.1, if you specify the alignment it will 
also be checked, so you'll get SIGBUS if its not right.

> * Can 32 bits be read to a dN register with zeroing of the other 32
>    bits?  (See comment "surely we can read...".)

No.  But you don't have to go through a core register as you did,
you can read directly into a single lane:

	vmov.i64	d0, #0
	vld1.i32	{d0[0]}, [up]!

> * Could one shave of an instruction in the final accumulation?  We don't
>    really need 64-bit accumulators.

How about:
					C we have 8 16-bit counts
L(e0):	vpaddl.u16	q8, q8		C we have 4 32-bit counts
	vmov		r0, r1, d16
	vmov		r2, r3, d17
	add		r0, r0, r1
	add		r2, r2, r3
	add		r0, r0, r2

It trades 1 vpaddl for two add insns, but the total latency is probably 
a cycle or two better since we're now operating in core.

> * Can one read four 128-bit values using just one insn (for inner loop)?

No.  We can only read 4 64-bit values.  I didn't actually realize the 
assembler would accept Q registers in the <list> grammar non-terminal.

r~