Neon addmul_8

Tue Feb 26 19:41:24 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
  not require the addend input until the 4th cycle, producing full output on the
  5th.  This seems to be the easiest way to hide a lot of output latency.

I measured a few of your more surprising numbers now, and I agree.

I didn't check the vmlal acc latency, but I recall to have seen similar
helpfllll behaviour for umlal and umaal.

  I'm not sure quite what's going on with the 3/4 issue rates.  I really would
  have expected to see either exactly 1, or very nearly 1/2, especially for vadd.

I think you mean 4/3.  But also that is an underestimate. with 8-way
unrolling I get a bit more, about 7/5.

  cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar

There is a tradition among older LISP programmers to use names of about
that length.  Preferably using several names with minimal edit distance.

-- 
Torbjörn