arm "neon"

Thu Feb 21 10:11:53 CET 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> IIRC, there is an almost parallel set of SIMD multiply-accumulate
> insns.

Found the vmlal instruction now. Makes for a cute loop,

.Loop:
        vld1.32         l01[1], [vp]!
        vld1.32         {u00[]}, [up]!
        vaddl.u32       q1, l01, c01
        vmlal.u32       q1, u00, v01  C q1 overlaps with c01 and l01
        subs            n, #1
        vst1.32         l01[0], [rp]!
        vshr.u64        l01, l01, #32
        bne             .Loop

but still very slow, 18 cycles / iteration, or 9 c/l.

> One might need to use a bigger building block, say addmul_4, in order to
> deal with accumulation latency.

Maybe. That's going to be a larger project. How do you usually organizes
addmul_4? Do you have an iteration that multiplies all four v limbs by
the same ulimb, or something more fancy?

SIMD carry propagation gets ugly if carry propagates beyond two limbs,
i.e if we need any larger primitives than umaal (the combination
vaddl.u32; vmlal.u32 above is essentially two parallell umaal).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.