Neon addmul_8

Tue Feb 26 15:15:52 CET 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> A9 is a quite uninteresting core for Neon.  Please make your
> experiements on an A15 instead.

Maybe later, but for now, A9 is my target platform. But it seems you're
right that Neon is almost useless there.

> Make every instruction write to a unique register. 
>
> Make every instruction read the same register, which is never written.

Hmm, I tried changing all output registers to unique registers (only
written once in the loop, never ever read (except as vmlal reads the
output register before accumulating to it). Do you mean that I need to
change the *input* registers of all instructions too?

For simplicity, I did this to my addmul_4 loop. Cycle time dropped from
4.5 c/l to 4. I.e., 16 cycles needed to execute a loop consisting of 11
almost completlyindependent instructions,

 2 vmlal
 2 vext
 2 vpaddl
 2 ld1 (scalar)
 1 st1 (scalar)
 1 subs
 1 bne

Then it's clear some execution unit is not able to keep up with
instrution decoding at all, right? I then try taking out some
instructions (but keeping load, store and looping):

  no arithmetic: 1.75 c/l (only load, store, loop overhead)
  vmlal only:    2.75 c/l (also the same with vmull instead of vmlal)
  vext only:     2.5 c/l
  vext+vmlal:    3.5 c/l
  vpaddl only:   2.0 c/l
  vpaddl+vext:   3.0 c/l
  vpadd+vmlal:   3.0 c/l
  all:           4.0 c/l

What conclusions can one draw from this exercise? It seems that vext and
vmlal compete for execution resources, while vpaddl can be done mostly
in parallell with the other operations.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.