ARM Neon popcount

Wed Feb 27 23:33:06 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  > * I completely ignore alignment.  Is that bad?

  I'm not sure about that.  It's something that perhaps we should
  experiment with.  As written, the code will work, as the chip will
  handle totally unaligned data.  What I don't know is whether
  *specifying* increased alignment in the insn helps.  E.g.

  	vld1.32		{ q1, q2 }, [r0 at 128]!

  As specified in section A.3.2.1, if you specify the alignment it will
  also be checked, so you'll get SIGBUS if its not right.

I wanted to experiment, but I cannot find any syntax which is accepted
by gas.  @128 does not work (in gas 2.22).

I cannot see any A15 performance difference with the current code, as I
play with alignment.

It is possible we pay a price always when using the current unaligned
insn form.

(On A9 the vld1 insns are clearly quite expensive; by removing them the
code runs at 1.6 c/l.)

  	vmov.i64	d0, #0
  	vld1.i32	{d0[0]}, [up]!

Thanks.

  > * Could one shave of an instruction in the final accumulation?  We don't
  >    really need 64-bit accumulators.

  How about:
  					C we have 8 16-bit counts
  L(e0):	vpaddl.u16	q8, q8		C we have 4 32-bit counts
  	vmov		r0, r1, d16
  	vmov		r2, r3, d17
  	add		r0, r0, r1
  	add		r2, r2, r3
  	add		r0, r0, r2

  It trades 1 vpaddl for two add insns, but the total latency is
  probably a cycle or two better since we're now operating in core.

Need to test that, I think.  I fear the corereg<->vreg bandwidth might
be poor.

  > * Can one read four 128-bit values using just one insn (for inner loop)?

  No.  We can only read 4 64-bit values.  I didn't actually realize the
  assembler would accept Q registers in the <list> grammar non-terminal.

It makes the code a bit more readable, since we avoid aliasing.

-- 
Torbjörn