Neon addmul_8

Tue Feb 26 15:27:51 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Hmm, I tried changing all output registers to unique registers (only
  written once in the loop, never ever read (except as vmlal reads the
  output register before accumulating to it). Do you mean that I need to
  change the *input* registers of all instructions too?

Not if you manage to break all dependencies without doing that.  beating
on just one input reg frees up every other register for the write-once
coding...

  For simplicity, I did this to my addmul_4 loop. Cycle time dropped from
  4.5 c/l to 4. I.e., 16 cycles needed to execute a loop consisting of 11
  almost completlyindependent instructions,

   2 vmlal
   2 vext
   2 vpaddl
   2 ld1 (scalar)
   1 st1 (scalar)
   1 subs
   1 bne

  Then it's clear some execution unit is not able to keep up with
  instrution decoding at all, right? I then try taking out some
  instructions (but keeping load, store and looping):

    no arithmetic: 1.75 c/l (only load, store, loop overhead)
    vmlal only:    2.75 c/l (also the same with vmull instead of vmlal)
    vext only:     2.5 c/l
    vext+vmlal:    3.5 c/l
    vpaddl only:   2.0 c/l
    vpaddl+vext:   3.0 c/l
    vpadd+vmlal:   3.0 c/l
    all:           4.0 c/l

  What conclusions can one draw from this exercise? It seems that vext and
  vmlal compete for execution resources, while vpaddl can be done mostly
  in parallell with the other operations.

That's an important conclusion.  Perhaps one should avoid vext?  (But
since these experiments where done on A9, we shouldn't make that
conclusion at all).

If vext is bad also for A15, I'd hope to use the 32 + 64 -> 64 add
instruction for all (k'ish) column summations, and only when a column is
about to get ready, add previous carry-in, and then shuffle as needed
with vext.  That might reduce things to one vext per iteration.

Your vmlal performance seem strange.  I can run one vmlal every 2nd
cycle on my A9, i.e., a c/l of throughput.

-- 
Torbjörn