# arm "neon"

Richard Henderson rth at twiddle.net
Fri Feb 22 17:45:08 CET 2013

```On 02/22/2013 02:32 AM, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
>   Indeed, the last version that Niels posted doesn't pass this test.
>
> Oops.
>
>   The following does pass, but if I'm to believe the arithmetic it's
>   still fairly slow -- around 12cyc/sec.
>
> 12cyc/sec is a poor clock frequency.  :-)

Heh.

> Perhaps addmul_2 might not be easy to make fast for this target.
>
> I think an mul_basecase could be made to run at awesome speed.  We might
> need a building block of at least addmul_4, more likely something
> larger.

Perhaps.  At least there's 32 registers to play around with.

(FWIW, I made a mistake in register choices in the version that I
posted. It's only d8-d15 that are call-saved.  d0-d7 and d16-d31 are
call clobbered.)

> Neon has SIMD 32+32 -> 64 bit add.  Assume we want to do (32+32)+32 or
> ((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
> there good ISA support for that too?  It might require an insn that does
> 32+64 -> 64.

VADDL  32+32->64     Qd[n] = Dn[n] + Dm[n]
VADDW  64+32->64     Qd[n] = Qn[n] + Dm[n]
VPADAL 32+32+64->64  Qd[n/2] += Dn[2n] + Dn[2n+1]

There is a narrowing add insn which might still be interesting:

VADDHN 64+64->32     Dd[n] = (Qn[n] + Qm[n]) >> 32

Suppose you're looking to do a final sum in a vector and will
subsequently be shifting the data for addition into the next column.
You have two choices:

vsra.i64	Qc1, Qc0, #32
or