ARM Neon popcount
Torbjorn Granlund
tg at gmplib.org
Wed Feb 27 23:33:06 CET 2013
Richard Henderson <rth at twiddle.net> writes:
> * I completely ignore alignment. Is that bad?
I'm not sure about that. It's something that perhaps we should
experiment with. As written, the code will work, as the chip will
handle totally unaligned data. What I don't know is whether
*specifying* increased alignment in the insn helps. E.g.
vld1.32 { q1, q2 }, [r0 at 128]!
As specified in section A.3.2.1, if you specify the alignment it will
also be checked, so you'll get SIGBUS if its not right.
I wanted to experiment, but I cannot find any syntax which is accepted
by gas. @128 does not work (in gas 2.22).
I cannot see any A15 performance difference with the current code, as I
play with alignment.
It is possible we pay a price always when using the current unaligned
insn form.
(On A9 the vld1 insns are clearly quite expensive; by removing them the
code runs at 1.6 c/l.)
vmov.i64 d0, #0
vld1.i32 {d0[0]}, [up]!
Thanks.
> * Could one shave of an instruction in the final accumulation? We don't
> really need 64-bit accumulators.
How about:
C we have 8 16-bit counts
L(e0): vpaddl.u16 q8, q8 C we have 4 32-bit counts
vmov r0, r1, d16
vmov r2, r3, d17
add r0, r0, r1
add r2, r2, r3
add r0, r0, r2
It trades 1 vpaddl for two add insns, but the total latency is
probably a cycle or two better since we're now operating in core.
Need to test that, I think. I fear the corereg<->vreg bandwidth might
be poor.
> * Can one read four 128-bit values using just one insn (for inner loop)?
No. We can only read 4 64-bit values. I didn't actually realize the
assembler would accept Q registers in the <list> grammar non-terminal.
It makes the code a bit more readable, since we avoid aliasing.
--
Torbjörn
More information about the gmp-devel
mailing list