arm "neon"

Thu Jan 17 18:36:28 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > Note that there are two *parallel* recurrency paths, one over over cya
  > and one over cyb.  Pairwise adjacent umaal have a dependency, but that's
  > of the benign, non-recurrent type.

  I don't fully understand it, but at a closer look it appears that there
  *are* indeed independent umaal operations.

  E.g., the first two in the loop

  	umaal	r4, cya, u1, v0
  	.. store and reload r4 ...
  	umaal	r5, cyb, u1, v1

  If we use registers like

    d0:  v0, v1
    d1:  u1, u1
    d2:  r4, r5
    d3:  cya, cyb

  precisely the same operations could be done with neon instructions as

    vmull.u32	q3, d0, d1
    vaddl.u32	q4, d2, d3
    vadd		q4, q4, q3

  Do you agree? It would be 4 cycles on a9, 3 on a15. And then there will
  be some data movements needed as well.

I agree (in principle, I didn't study your operands).

I see this as follows.

Imagine we have a SIMD multiply that handles k limb pairs, generating k
2-limb products.

Then we could use this in various ways.  If we have a many-to-one
addition, we could use the SIMD multiply for

   v_0*u_{i+k-1} + v_1*u_{i+k-2} + ... + v_{k-1}*u{i+0}

to build an addmul_k.  There terms all have sigificance k-1 (+i).

An alternative is to do

   v_0*u_{i+k-1} + v_0*u_{i+k-2} + ... + v_0*u{i+0}
   v_1*u_{i+k-1} + v_1*u_{i+k-2} + ... + v_1*u{i+0}
   ...

where a SIMD multiply is used for each line.  The least significant limb
of each line is a ready summation of products.  By "shifting down" one
limb after each multiply, this result is aligned to be summed to the
next one.

-- 
Torbjörn