arm "neon"
Torbjorn Granlund
tg at gmplib.org
Mon Jan 14 18:40:48 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
Torbjorn Granlund <tg at gmplib.org> writes:
> I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very
> cooperative in that the accumulation dependency is very shallow.
Nice. Is the same true for the non-simd umaal instruction?
Things are complicated, I cannot get my head around what's going on.
This example with partial overlapping runs at 12 cycles:
umaal r2, r1, r14, r14
umaal r2, r3, r14, r14
umaal r2, r5, r14, r14
umaal r2, r7, r14, r14
This example with 100% overlapping runs at 8 cycles:
umaal r2, r1, r14, r14
umaal r2, r1, r14, r14
umaal r2, r1, r14, r14
umaal r2, r1, r14, r14
THis example with partial overlapping runs at 16 cycles:
umaal r2, r1, r14, r14
umaal r4, r1, r14, r14
umaal r6, r1, r14, r14
umaal r8, r1, r14, r14
With completely independent operands, things run at 8 cycles.
--
Torbjörn
More information about the gmp-devel
mailing list