arm "neon"

Wed Feb 20 23:05:12 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Unfortunately not. speed -C -s ... mpn_addmul_2 reported around 14
  cycles, so it's 7 c/l, compared to 2.38 for the current non-simd code.
  If I interpret speed output correctly.

Include addmul_1.3 in the measurements as a sanity check.

  > What about SIMD multiply-accumulate?  IIRC, these insns have the same
  > latency ate throughput as non-accumulating SIMD multiplies.

  Should look into that (I didn't notice any useful integer
  multiply-accumulate instructions on my first reading of the manual). But
  I suspect you get them on the critical path, and then the relevant
  comparison is to add latency, not mul latency.

IIRC, there is an almost parallel set of SIMD multiply-accumulate insns.
One might need to use a bigger building block, say addmul_4, in order to
deal with accumulation latency.

I did measure SIMD multiply(-accumulate) throughput some months ago and
concluded it was great, at least for A15 but I think it was great also
or A9.  I did not measure the other needed insns separately or in a mix
with multiply insns.  It might be the case that SIMD add and SIMD
multiply compete for decoding slots or issue slots.  Not an uncommon
design tradeoff.  The more importand would be using multiply-accumulate.

-- 
Torbjörn