Neon addmul_8

Torbjorn Granlund tg at
Tue Feb 26 19:41:24 CET 2013

Richard Henderson <rth at> writes:

  Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
  not require the addend input until the 4th cycle, producing full output on the
  5th.  This seems to be the easiest way to hide a lot of output latency.
I measured a few of your more surprising numbers now, and I agree.

I didn't check the vmlal acc latency, but I recall to have seen similar
helpfllll behaviour for umlal and umaal.

  I'm not sure quite what's going on with the 3/4 issue rates.  I really would
  have expected to see either exactly 1, or very nearly 1/2, especially for vadd.
I think you mean 4/3.  But also that is an underestimate. with 8-way
unrolling I get a bit more, about 7/5.


There is a tradition among older LISP programmers to use names of about
that length.  Preferably using several names with minimal edit distance.


More information about the gmp-devel mailing list