arm "neon"

Sat Feb 23 22:01:41 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  > The reason for that is that we use Karatsuba's algoritm for counts of
  > over about 20.

  Good to know.  I won't base my tuning on tests/devel/addmul_N's
  default of ~600 limbs then.

Using a larger count < L1size might still be useful for measuring loop
performance.  With smaller counts, call overhead and loop overhead will
blur measurements.

  > For addmul_(k) for large enough k every k (or perhaps every k/2 with
  > 2-way SIMD) mulacc insn will be dependent.  So here, mulacc will be on a
  > recurrency path, but there will be k (or k/2) parallel recurrency paths.
  >
  > I believe the Neon approach will only be a real win for A15, where we
  > should be able to approach 0.7 c/l.  Not for addmul_2, but for
  > addmul_(k), for some non-crazy k.

  I'm going to try again with addmul_4 instead.  Getting the same level
  of parallelism from addmul_2 requires a main loop processing 16 limbs.
  At which point, given your Karatsuba commend, we'll likely only
  traverse the loop once.

If the mul_basecase cross product speed gets close to 1 c/l, we can
expect Karatsuba threshold to be perhaps 50 or 60.  There is a lot
dynamics here.

With a clever inner loop, we can use the same patten in feed-in and
wind-down, getting good performance for the entire thing.  (And in
mul_basecase, iterating an addmul_k building block, one may overlap a
wind-down with feed-in thing, making it look very much like the
innerloop except that we break dependencies.)

  The addmul_4 will require a minimum size=8 to enter the main 4x4 loop,
  with a secondary 4x1 loop to clean up the last 7 limbs.

  I suspect that an addmul_k, k >= 6 might need only a Kx1 main loop to
  keep the pipes flowing.

I suspect such an approach (i.e, smallest k for which no unrolling gives
great performance) might ultimately prove to be best, at least for the
critically important mul_basecase.

  > Interestingly enough, I suspect such code might also be reusable for
  > A5x, since they have poor 64 x 64 -> 128 integer multiply support.  If
  > my information about the high-end A57 is correct, one such product can
  > be formed every 7th cycle.  If the Neon units are the same as in A15,
  > Neon code would be twice as fast.

  Interesting.  I've only got access to an A15, so that's pretty much
  all I care about atm.  But these trials are on the list, if someone
  else wants to do speed testing.

I've got access to an A9 too, which has a worse Neon unit.

What I've heard about A57 is that it's innards basically is an A15, with
plain functional units extended to 64-bit, and of course new decoder and
a larger register bank.  There is just one 32x32 -> 64 multiply with a
throughput of one op per cycle.  An v8 mul insn will require 3 runs
through the multiply unit, an v8 umulh insn will require 4 runs through
the multiply unit.

To summarise:

        32x32->64    32x32->64        64x64->128    64x64->128
          scalar       vector           scalar       vector
A9           1/2          1
A15          1/2          2
A57           1           2              1/7           2/7

We'll have to wait for A57 silicon to see if this is true.

-- 
Torbjörn