arm "neon"

Sat Feb 23 14:31:36 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  Down to 5.8 cyc/limb.  Good, but not fantastic.  I'm gonna try one more time
  with larger unrolling to make full use of the vector load insns, and less
  over-prefetching.

Good improvement!

Keep in mind that addmul_ will be used for smallish count almost
always.  Prefetching will not help for typical use.

The reason for that is that we use Karatsuba's algoritm for counts of
over about 20.

Large counts to occur, so if we can prefetch without causing slowing
down for the common case, then we should.

  I guess the target is anything under 2.5 cyc/limb, against the armv6 integer
  version?

For A9 your we need to beat 2.38 c/l, for A15 it is 2.5 c/l.

As stated in a FIXME 2.25 c/l is possible for A9.

I never optimised the current code for A15, and suspect it could be
improved to 2.25 or 2.0 c/l.

I'm not trying to discourage the Neon project, be we should know the
target to aim for...

Some comments about the addmul approaches:

The name of the game for addmul_1 or addmul_2 is keeping the recurrency
path shallow.  When using mulacc insns, we should add in the rp[] data
there, since that puts the mulacc off of the recurrency path.

For addmul_(k) for large enough k every k (or perhaps every k/2 with
2-way SIMD) mulacc insn will be dependent.  So here, mulacc will be on a
recurrency path, but there will be k (or k/2) parallel recurrency paths.

I believe the Neon approach will only be a real win for A15, where we
should be able to approach 0.7 c/l.  Not for addmul_2, but for
addmul_(k), for some non-crazy k.

Interestingly enough, I suspect such code might also be reusable for
A5x, since they have poor 64 x 64 -> 128 integer multiply support.  If
my information about the high-end A57 is correct, one such product can
be formed every 7th cycle.  If the Neon units are the same as in A15,
Neon code would be twice as fast.

-- 
Torbjörn