arm "neon"

Fri Feb 22 11:32:44 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  Indeed, the last version that Niels posted doesn't pass this test.

Oops.

  The following does pass, but if I'm to believe the arithmetic it's
  still fairly slow -- around 12cyc/sec.

12cyc/sec is a poor clock frequency.  :-)

  If one is even more clever than I, one could do a 4x unroll, making
  best use of vld4.  But when you do that, getting the carries right
  becomes even more tricky.  But I think any correct solution will
  involve chains of vsra to shift and add up the chain.

Perhaps addmul_2 might not be easy to make fast for this target.

I think an mul_basecase could be made to run at awesome speed.  We might
need a building block of at least addmul_4, more likely something
larger.

Neon has SIMD 32+32 -> 64 bit add.  Assume we want to do (32+32)+32 or
((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
there good ISA support for that too?  It might require an insn that does
32+64 -> 64.

The key here is accumulation support, as always with SIMD.

Without good such ISA support, we probably need more right shift
operations, which will damage performance.

-- 
Torbjörn