Neon addmul_8
Niels Möller
nisse at lysator.liu.se
Tue Feb 26 14:14:30 CET 2013
Richard Henderson <rth at twiddle.net> writes:
> It might be worth a try having two copies of this main loop, one in
> which there are more than 8 limbs remaining,
I tried that, and I ended up with something *very* similar to your
addmul_8 (after first writing addmul_4 and addmul_6).
The following loop runs at 3.24 c/l on my A9 (according to the addmul_N
program):
.Loop:
vld1.32 Dtmp[0], [vp]!
C Critical path starts here
vmlal.u32 Qc01, Dv01, Du00
vmlal.u32 Qc23, Dv23, Du00
vmlal.u32 Qc45, Dv45, Du00
vmlal.u32 Qc67, Dv67, Du00
vld1.32 {Du00[]}, [up]!
subs n, #1
C Shift and add
vst1.32 Dc0[0], [rp]!
vext.32 Qc01, Qc01, Qc23, #1
vext.32 Qc23, Qc23, Qc45, #1
vext.32 Qc45, Qc45, Qc67, #1
vext.32 Qc67, Qc67, Qtmp, #1
vpaddl.u32 Qc01, Qc01
vpaddl.u32 Qc23, Qc23
vpaddl.u32 Qc45, Qc45
vpaddl.u32 Qc67, Qc67
bne .Loop
I tried moving things around to interleave independent operations, but I
only managed to slow it down. I initially used separete mul and add, but
vmlal was faster. Recurrency (for each carry register in parallel) is
vmlal, vext, vpaddl.
3.25 c/l means that one iteration in this loop takes 26 cycles, for 17
instructions. To me, that's surprisingly slow, the instruction sequence
looks very friendly with few dependencies and ample opportunities for
executing two instructions in parallel,
I also tried reversing the order of operations doing Qc67 first and Qc01
last (on the theory that this matches the natural dependencies in the
vext shifting), using additional registers to keep the values between
vext and vpaddl, but that was a slowdown.
Here are cycle numbers for my attempts:
addmul_2: 8.95 c/l
addmul_4: 4.49 c/l
addmul_6: 3.66 c/l
addmul_8: 3.24 c/l
Untried tricks: One could try to use vuzp to separate high and low
parts of the products. Then only the low parts need shifting around.
I guess I'll try that with addmul_4 first, to see if it makes for any
improvement. One could maybe use vaddw, to delay adding in one of the
carry limbs, reducing the recurrency to only vuzp, vaddw (but if the
recurrency isn't the bottleneck, that won't help).
Anyway, it seems very challenging to make this neon code competitive on
cortex-a9. I really wonder where the bottleck might be for the above
loop.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list