arm "neon"

Fri Feb 22 17:45:08 CET 2013

On 02/22/2013 02:32 AM, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
> 
>   Indeed, the last version that Niels posted doesn't pass this test.
>   
> Oops.
> 
>   The following does pass, but if I'm to believe the arithmetic it's
>   still fairly slow -- around 12cyc/sec.
>   
> 12cyc/sec is a poor clock frequency.  :-)

Heh.

> Perhaps addmul_2 might not be easy to make fast for this target.
> 
> I think an mul_basecase could be made to run at awesome speed.  We might
> need a building block of at least addmul_4, more likely something
> larger.

Perhaps.  At least there's 32 registers to play around with.

(FWIW, I made a mistake in register choices in the version that I
posted. It's only d8-d15 that are call-saved.  d0-d7 and d16-d31 are
call clobbered.)

> Neon has SIMD 32+32 -> 64 bit add.  Assume we want to do (32+32)+32 or
> ((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
> there good ISA support for that too?  It might require an insn that does
> 32+64 -> 64.

The widening add insns are:
  VADDL  32+32->64     Qd[n] = Dn[n] + Dm[n]
  VADDW  64+32->64     Qd[n] = Qn[n] + Dm[n]
  VPADDL 32+32->64     Qd[n/2] = Dn[2n] + Dn[2n+1]    ("horizontal add")
  VPADAL 32+32+64->64  Qd[n/2] += Dn[2n] + Dn[2n+1]

There is a narrowing add insn which might still be interesting:

  VADDHN 64+64->32     Dd[n] = (Qn[n] + Qm[n]) >> 32

Suppose you're looking to do a final sum in a vector and will
subsequently be shifting the data for addition into the next column.
You have two choices:

	vadd.i64	Qc0, Qa0, Qb0
	vsra.i64	Qc1, Qc0, #32
or
	vaddhn.i64	Dtmp, Qa0, Qb0
	vaddw.u32	Qd1, Qc1, Dtmp

Such a re-ordering might be able to make data available for input
earlier.  Or may be able to store data away in a single D register
rather than keeping around the double Q, easing register pressure.

r~