arm "neon"
Richard Henderson
rth at twiddle.net
Fri Feb 22 21:08:10 CET 2013
On 02/22/2013 10:20 AM, Torbjorn Granlund wrote:
> Useful. Is there any 32+32 >> 32 -> 32? I.e., carry-out.
Sadly, no. Or if there is, I missed it.
Also interesting, as I'm looking around, is VEXT. Consider
vmull.u32 Qa01, Du00, Dv01
vmull.u32 Qb12, Du11, Dv01
which gives us 64-bit products
a1 a0
b2 b1
considered as v4si vectors (gcc-speak for 4 x 32-bit) as
{ b2h, b2l, b1h, b1l }, { a1h, a1l, a0h, a0l }
Apply
vext.32 Qc01, Qa01, Qb12, #1
and we get Qc01 = { b1l, a1h, a1l, a0h }. If you look at the pairs is exactly
the input we'd like to feed into
vpaddl.u32 Qc01, Qc01
to achieve the v2di vector { b11 + a1h, a1h + b0h }.
Now, we all know that u32 * u32 + u32 + u32 cannot overflow u64 (indeed exactly
fits), so the output of that vpaddl could be used as the addend to a multiply
round with vmlal.
Which suggests a code structure like
.Loop:
vmlal.u32 Qp01, Du00, Dv01 @ v2di{ p1, p0 }
vst1.u32 {Dp0[0]}, [rp]! @ store p0l
vext.32 Qp01, Qzero, Qp01 @ v4si{ 0, p1h, p1l, p0h }
vpaddl.u32 Qp01, Qp01 @ v2di{ p1h, p1l+p0h }
// bookkeeping
bne .Loop
I.e. we store out 32-bits each round, keeping a "48-bit" rolling carry going
between each stage. If this works, it's significantly less overhead than the
structure I posted yesterday.
Oh, wait, this misses the addend part of addmul. Hmm. We have room in the
rolling carry where I shift in zero above. That could contain the addend
element from the appropriate round instead.
Perhaps I should give this another go...
r~
More information about the gmp-devel
mailing list