arm "neon"

Sat Feb 23 15:06:10 CET 2013

Richard Henderson <rth at twiddle.net> writes:

> Down to 5.8 cyc/limb.  Good, but not fantastic.  I'm gonna try one more time
> with larger unrolling to make full use of the vector load insns, and less
> over-prefetching.

Cool. This is great reading for an ARM novice like myself.

I see the loop does two 32-bit stores to rp.

> 	vst1.u32	{Dc0[0]}, [rp]!		@ output lowest in-flight limb
[...]
> 	vst1.u32	{Dc0[0]}, [rp]!

I guess one could replace the first store by a move to a temporary
register, and then combine the values for a single 64-bit store
(possibly: shift when moving the first value, and another vext to
combine them).

Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.