ARM Neon popcount
Richard Henderson
rth at twiddle.net
Wed Feb 27 23:13:21 CET 2013
On 2013-02-27 13:27, Torbjorn Granlund wrote:
> Specific questions:
> * I completely ignore alignment. Is that bad?
I'm not sure about that. It's something that perhaps we should
experiment with. As written, the code will work, as the chip will
handle totally unaligned data. What I don't know is whether
*specifying* increased alignment in the insn helps. E.g.
vld1.32 { q1, q2 }, [r0 at 128]!
As specified in section A.3.2.1, if you specify the alignment it will
also be checked, so you'll get SIGBUS if its not right.
> * Can 32 bits be read to a dN register with zeroing of the other 32
> bits? (See comment "surely we can read...".)
No. But you don't have to go through a core register as you did,
you can read directly into a single lane:
vmov.i64 d0, #0
vld1.i32 {d0[0]}, [up]!
> * Could one shave of an instruction in the final accumulation? We don't
> really need 64-bit accumulators.
How about:
C we have 8 16-bit counts
L(e0): vpaddl.u16 q8, q8 C we have 4 32-bit counts
vmov r0, r1, d16
vmov r2, r3, d17
add r0, r0, r1
add r2, r2, r3
add r0, r0, r2
It trades 1 vpaddl for two add insns, but the total latency is probably
a cycle or two better since we're now operating in core.
> * Can one read four 128-bit values using just one insn (for inner loop)?
No. We can only read 4 64-bit values. I didn't actually realize the
assembler would accept Q registers in the <list> grammar non-terminal.
r~
More information about the gmp-devel
mailing list