I decided to play a bit with Neon, but instead of doing something hard
like addmul_k, I wrote an mpn_popcount.  :-)

The code runs well for A15 at about 0.56 c/l, but much worse on A9 at
about 2.8 c/l.  (The inner-loops hard whacking on q8 is a problem on A9;
using a8 and a9 alternatingly shaves off about 0.4 c/l.  Still

I am a novice at Neon hacking, so I am sure this can be improved in
various ways.

Specific questions:
* I completely ignore alignment.  Is that bad?
* Can 32 bits be read to a dN register with zeroing of the other 32
  bits?  (See comment "surely we can read...".)
* Could one shave of an instruction in the final accumulation?  We don't
  really need 64-bit accumulators.
* Can one read four 128-bit values using just one insn (for inner loop)?

