arm "neon"

Mon Jan 14 14:51:13 CET 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> But IIUC, we are thus performing a 32 x 32 -> 64 mul per cycle.
> Can one stick addition here without consuming cycles?

As I understand the manual, operations in the main cpu can be done in
parallel with the simd instructions. But it also warns about transferring
data between core registers and simd registers, with little details.

Doing carry propagation with the simd registers seems awkward. One would
need some comparisons to get carry conditions. And to make it even
worse, it seems the comparison instruction, vcgt, doesn't support 64bit
operations.

One can do two parallel umlaal using

  vmull (two 32x32 -> 64), vaddl (two 32 + 32 -> 64), vadd (two 64-bit adds)

That avoids carry propagation beyond 64 bits.

Is it possible to arrange an addmul_2 (or any other interesting
function) with two *independent* umaal-like operations? If we have to
accept that we can't do any adds in parallel, addmul_2 would need
something like

  vmull.u32,	computing u0*v0 and u1 *v0
  vaddl.u32,	lo (chain variable) + r0 (result area)
  vadd.u64,	add above to u0*v0
  vaddl.u32,	hi (chain variable) + high half of above sum
  vadd.u64,	add above sum to u1 * v0

Looks like 6 cycles (which is poor, right?), excluding any data
movement. And recurrency latency of four adds, which shouldn't be too
bad, I imagine.

There's also vmlal (mul and accumulate). One could shave one cycle off
the recurrency chain by using vmlal rather than vmull, to add in r0
earlier, and then deleting one of the low adds. And one could possibly
add in hi (high part of chain variable) with the same vmlal, but I'm not
sure that's very usful. The challenge is that one still has to add the
high part of the low product into the low part of the high product, and
that's serial, not parallel. But one could potentially reduce the number
of instructions

  vmlal.u32,	compute u0*v0 + r0 and u1*v0 + hi
  vadd.u64,	add lo (chain variable) to low product
  vadd.u64,	add high half of above to high product

That would be 4 cycles, but one also needs to somehow extend the values
we add from 32 bits to 64, which I guess isn't for free.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.