# arm "neon"

Richard Henderson rth at twiddle.net
Fri Feb 22 21:08:10 CET 2013

```On 02/22/2013 10:20 AM, Torbjorn Granlund wrote:
> Useful.  Is there any 32+32 >> 32 -> 32?  I.e., carry-out.

Sadly, no.  Or if there is, I missed it.

Also interesting, as I'm looking around, is VEXT.  Consider

vmull.u32	Qa01, Du00, Dv01
vmull.u32	Qb12, Du11, Dv01

which gives us 64-bit products

a1  a0
b2  b1

considered as v4si vectors (gcc-speak for 4 x 32-bit) as

{ b2h, b2l, b1h, b1l }, { a1h, a1l, a0h, a0l }

Apply

vext.32		Qc01, Qa01, Qb12, #1

and we get Qc01 = { b1l, a1h, a1l, a0h }.  If you look at the pairs is exactly
the input we'd like to feed into

to achieve the v2di vector { b11 + a1h, a1h + b0h }.

Now, we all know that u32 * u32 + u32 + u32 cannot overflow u64 (indeed exactly
fits), so the output of that vpaddl could be used as the addend to a multiply
round with vmlal.

Which suggests a code structure like

.Loop:
vmlal.u32	Qp01, Du00, Dv01	@ v2di{ p1, p0 }
vst1.u32	{Dp0[0]}, [rp]!		@ store p0l
vext.32		Qp01, Qzero, Qp01	@ v4si{ 0, p1h, p1l, p0h }
vpaddl.u32	Qp01, Qp01  		@ v2di{ p1h, p1l+p0h }
// bookkeeping
bne		.Loop

I.e. we store out 32-bits each round, keeping a "48-bit" rolling carry going
between each stage.  If this works, it's significantly less overhead than the
structure I posted yesterday.

Oh, wait, this misses the addend part of addmul.  Hmm.  We have room in the
rolling carry where I shift in zero above.  That could contain the addend
element from the appropriate round instead.

Perhaps I should give this another go...

r~
```