arm "neon"

Fri Feb 22 21:08:10 CET 2013

On 02/22/2013 10:20 AM, Torbjorn Granlund wrote:
> Useful.  Is there any 32+32 >> 32 -> 32?  I.e., carry-out.

Sadly, no.  Or if there is, I missed it.

Also interesting, as I'm looking around, is VEXT.  Consider

	vmull.u32	Qa01, Du00, Dv01
	vmull.u32	Qb12, Du11, Dv01

which gives us 64-bit products

	    a1  a0
	b2  b1

considered as v4si vectors (gcc-speak for 4 x 32-bit) as

	{ b2h, b2l, b1h, b1l }, { a1h, a1l, a0h, a0l }

Apply

	vext.32		Qc01, Qa01, Qb12, #1

and we get Qc01 = { b1l, a1h, a1l, a0h }.  If you look at the pairs is exactly
the input we'd like to feed into

	vpaddl.u32	Qc01, Qc01

to achieve the v2di vector { b11 + a1h, a1h + b0h }.

Now, we all know that u32 * u32 + u32 + u32 cannot overflow u64 (indeed exactly
fits), so the output of that vpaddl could be used as the addend to a multiply
round with vmlal.

Which suggests a code structure like

.Loop:
	vmlal.u32	Qp01, Du00, Dv01	@ v2di{ p1, p0 }
	vst1.u32	{Dp0[0]}, [rp]!		@ store p0l
	vext.32		Qp01, Qzero, Qp01	@ v4si{ 0, p1h, p1l, p0h }
	vpaddl.u32	Qp01, Qp01  		@ v2di{ p1h, p1l+p0h }
	// bookkeeping
	bne		.Loop

I.e. we store out 32-bits each round, keeping a "48-bit" rolling carry going
between each stage.  If this works, it's significantly less overhead than the
structure I posted yesterday.

Oh, wait, this misses the addend part of addmul.  Hmm.  We have room in the
rolling carry where I shift in zero above.  That could contain the addend
element from the appropriate round instead.

Perhaps I should give this another go...

r~