Neon addmul_8

Tue Feb 26 14:24:30 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  I tried that, and I ended up with something *very* similar to your
  addmul_8 (after first writing addmul_4 and addmul_6).

  The following loop runs at 3.24 c/l on my A9 (according to the addmul_N
  program):

A9 is a quite uninteresting core for Neon.  Please make your
experiements on an A15 instead.

  I tried moving things around to interleave independent operations, but I
  only managed to slow it down. I initially used separete mul and add, but
  vmlal was faster. Recurrency (for each carry register in parallel) is
  vmlal, vext, vpaddl.

Which might be the killer.

  3.25 c/l means that one iteration in this loop takes 26 cycles, for 17
  instructions. To me, that's surprisingly slow, the instruction sequence
  looks very friendly with few dependencies and ample opportunities for
  executing two instructions in parallel,

We should probably work out the latencies for the interesting
instructions.  That's not hard to do.

  I also tried reversing the order of operations doing Qc67 first and Qc01
  last (on the theory that this matches the natural dependencies in the
  vext shifting), using additional registers to keep the values between
  vext and vpaddl, but that was a slowdown.

  Here are cycle numbers for my attempts:

  addmul_2: 8.95 c/l
  addmul_4: 4.49 c/l
  addmul_6: 3.66 c/l
  addmul_8: 3.24 c/l

  Untried tricks: One could try to use vuzp to separate high and low
  parts of the products. Then only the low parts need shifting around.
  I guess I'll try that with addmul_4 first, to see if it makes for any
  improvement. One could maybe use vaddw, to delay adding in one of the
  carry limbs, reducing the recurrency to only vuzp, vaddw (but if the
  recurrency isn't the bottleneck, that won't help).

  Anyway, it seems very challenging to make this neon code competitive on
  cortex-a9. I really wonder where the bottleck might be for the above
  loop.

You could try a dependency breaking trick I tend to use in situations
such as this:

Make every instruction write to a unique register.  (To get away with
that, you might need to save/restore the callee-saves registers.  See
Richard's previous short message about that.)

Make every instruction read the same register, which is never written.

Now, there are no dependencies at all (no RAW, no WAW, and no WAR
(anti-dependencies)).  Does the sequence run faster?  By rearranging
insns to something more balanced, does it run faster?

This is a 10 minute experiment which give a lot of information of the
potential of the instruction mix.

-- 
Torbjörn