arm "neon"
Niels Möller
nisse at lysator.liu.se
Thu Feb 21 10:11:53 CET 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> IIRC, there is an almost parallel set of SIMD multiply-accumulate
> insns.
Found the vmlal instruction now. Makes for a cute loop,
.Loop:
vld1.32 l01[1], [vp]!
vld1.32 {u00[]}, [up]!
vaddl.u32 q1, l01, c01
vmlal.u32 q1, u00, v01 C q1 overlaps with c01 and l01
subs n, #1
vst1.32 l01[0], [rp]!
vshr.u64 l01, l01, #32
bne .Loop
but still very slow, 18 cycles / iteration, or 9 c/l.
> One might need to use a bigger building block, say addmul_4, in order to
> deal with accumulation latency.
Maybe. That's going to be a larger project. How do you usually organizes
addmul_4? Do you have an iteration that multiplies all four v limbs by
the same ulimb, or something more fancy?
SIMD carry propagation gets ugly if carry propagates beyond two limbs,
i.e if we need any larger primitives than umaal (the combination
vaddl.u32; vmlal.u32 above is essentially two parallell umaal).
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list