arm "neon"

Torbjorn Granlund tg at
Fri Feb 22 11:32:44 CET 2013

Richard Henderson <rth at> writes:

  Indeed, the last version that Niels posted doesn't pass this test.

  The following does pass, but if I'm to believe the arithmetic it's
  still fairly slow -- around 12cyc/sec.
12cyc/sec is a poor clock frequency.  :-)

  If one is even more clever than I, one could do a 4x unroll, making
  best use of vld4.  But when you do that, getting the carries right
  becomes even more tricky.  But I think any correct solution will
  involve chains of vsra to shift and add up the chain.
Perhaps addmul_2 might not be easy to make fast for this target.

I think an mul_basecase could be made to run at awesome speed.  We might
need a building block of at least addmul_4, more likely something

Neon has SIMD 32+32 -> 64 bit add.  Assume we want to do (32+32)+32 or
((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
there good ISA support for that too?  It might require an insn that does
32+64 -> 64.

The key here is accumulation support, as always with SIMD.

Without good such ISA support, we probably need more right shift
operations, which will damage performance.


More information about the gmp-devel mailing list