Neon addmul_8
Torbjorn Granlund
tg at gmplib.org
Tue Feb 26 19:41:24 CET 2013
Richard Henderson <rth at twiddle.net> writes:
Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
not require the addend input until the 4th cycle, producing full output on the
5th. This seems to be the easiest way to hide a lot of output latency.
I measured a few of your more surprising numbers now, and I agree.
I didn't check the vmlal acc latency, but I recall to have seen similar
helpfllll behaviour for umlal and umaal.
I'm not sure quite what's going on with the 3/4 issue rates. I really would
have expected to see either exactly 1, or very nearly 1/2, especially for vadd.
I think you mean 4/3. But also that is an underestimate. with 8-way
unrolling I get a bit more, about 7/5.
cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar
There is a tradition among older LISP programmers to use names of about
that length. Preferably using several names with minimal edit distance.
--
Torbjörn
More information about the gmp-devel
mailing list