arm "neon"

Mon Jan 14 16:01:17 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > But IIUC, we are thus performing a 32 x 32 -> 64 mul per cycle.
  > Can one stick addition here without consuming cycles?

  As I understand the manual, operations in the main cpu can be done in
  parallel with the simd instructions. But it also warns about transferring
  data between core registers and simd registers, with little details.

  Doing carry propagation with the simd registers seems awkward. One would
  need some comparisons to get carry conditions. And to make it even
  worse, it seems the comparison instruction, vcgt, doesn't support 64bit
  operations.

IIRC, one can to 32 + 32 -> 64 and 32 + 64 -> 64.  To extract the high
part, perhaps there are some funny add instructons (perhaps vpadd) or
one must use right shift (vshr).

  One can do two parallel umlaal using

umlaal?  You probably mean umul, umaal, or umlal...

    vmull (two 32x32 -> 64), vaddl (two 32 + 32 -> 64), vadd (two 64-bit adds)

  That avoids carry propagation beyond 64 bits.

  Is it possible to arrange an addmul_2 (or any other interesting
  function) with two *independent* umaal-like operations? If we have to
  accept that we can't do any adds in parallel, addmul_2 would need
  something like

    vmull.u32,	computing u0*v0 and u1 *v0
    vaddl.u32,	lo (chain variable) + r0 (result area)
    vadd.u64,	add above to u0*v0
    vaddl.u32,	hi (chain variable) + high half of above sum
    vadd.u64,	add above sum to u1 * v0

  Looks like 6 cycles (which is poor, right?), excluding any data
  movement. And recurrency latency of four adds, which shouldn't be too
  bad, I imagine.

I didn't read carefully, and I miss v1 multiplies.

The parallelism of addmul_(k) for k >= 2 should allow for shallow
recurrency.  One may handle the v1 products many cycles before the v0
products.  For Neon, we should surely accumulate into 64-bit "d" types,
accumulating carry in bits 32, 33, etc.

  There's also vmlal (mul and accumulate). One could shave one cycle off
  the recurrency chain by using vmlal rather than vmull, to add in r0
  earlier, and then deleting one of the low adds.

If you say so.  I'd perhaps use vmlal for accumulating the rp[] operands
as the primary approach.

  And one could possibly add in hi (high part of chain variable) with
  the same vmlal, but I'm not sure that's very usful. The challenge is
  that one still has to add the high part of the low product into the
  low part of the high product, and that's serial, not parallel. But one
  could potentially reduce the number of instructions

having a non-zero operand in the high part wouldn't work unless we use
nails, since else it would overflow.

    vmlal.u32,	compute u0*v0 + r0 and u1*v0 + hi
    vadd.u64,	add lo (chain variable) to low product
    vadd.u64,	add high half of above to high product

  That would be 4 cycles, but one also needs to somehow extend the values
  we add from 32 bits to 64, which I guess isn't for free.

You might want to take a look at the repo mpn/arm/v6/addmul_2.asm code.
It avoids the long recurrency chain.

L(top): ldr     u0, [up, #4]
        umaal   r4, cya, u1, v0
        str     r4, [rp, #4]
        ldr     r4, [rp, #12]
        umaal   r5, cyb, u1, v1
        ldr     u1, [up, #8]
        umaal   r5, cya, u0, v0
        str     r5, [rp, #8]
        ldr     r5, [rp, #16]
        umaal   r4, cyb, u0, v1
        ...

Neat with just umaal and ld/st...

(The addmul_1 code is not good for A15, it should be rewritten to use
umul instead of umaal.)

-- 
Torbjörn