arm "neon"
Niels Möller
nisse at lysator.liu.se
Sat Feb 23 15:06:10 CET 2013
Richard Henderson <rth at twiddle.net> writes:
> Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
> with larger unrolling to make full use of the vector load insns, and less
> over-prefetching.
Cool. This is great reading for an ARM novice like myself.
I see the loop does two 32-bit stores to rp.
> vst1.u32 {Dc0[0]}, [rp]! @ output lowest in-flight limb
[...]
> vst1.u32 {Dc0[0]}, [rp]!
I guess one could replace the first store by a move to a temporary
register, and then combine the values for a single 64-bit store
(possibly: shift when moving the first value, and another vext to
combine them).
Not sure what the bottlenecks of your loop are though; instruction
decoding, load/store, or the recurrency chain (but at least it shouldn't
be multiplier throughput, right?).
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list