arm "neon"

Mon Jan 14 17:04:51 CET 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   One can do two parallel umlaal using
>   
> umlaal?  You probably mean umul, umaal, or umlal...

I meant umaal. I'm getting confused by all these random-looking
instruction names...

>     vmull.u32,	computing u0*v0 and u1 *v0
>     vaddl.u32,	lo (chain variable) + r0 (result area)
>     vadd.u64,	add above to u0*v0
>     vaddl.u32,	hi (chain variable) + high half of above sum
>     vadd.u64,	add above sum to u1 * v0
>     
>   Looks like 6 cycles (which is poor, right?), excluding any data
>   movement. And recurrency latency of four adds, which shouldn't be too
>   bad, I imagine.
>   
> I didn't read carefully, and I miss v1 multiplies.

The idea was that u0, u1 is the loop-invariant operand, and the above is
for one iteration processing only a single limb from v.

> The parallelism of addmul_(k) for k >= 2 should allow for shallow
> recurrency.  One may handle the v1 products many cycles before the v0
> products.  For Neon, we should surely accumulate into 64-bit "d" types,
> accumulating carry in bits 32, 33, etc.

A sum of 32-bit values can be accumulated into 64-bit register. But if
we want to accumulate 64-bit values, i.e., limb products, it gets
tricky.

> having a non-zero operand in the high part wouldn't work unless we use
> nails, since else it would overflow.

Agreed. I was suggesting something like

  u1 * v0  |  u0 * v0
     + c1  |     + c0
---------------------
  t3   t2     t1   t0

This can be done in parallel with a single vmlal. But next we need to
add those two 64-bit values with 32-bits overlap,

     t1 t0
+ t3 t2
----------
  c1 c0 r0

where r0 is the result word (old value also needs to be added in, if
it's addmul_2 rather than just mul_2), and c0, c1 are the recurrency
variables for the next iteration.

Maybe it's a poor way to think about addmul_2 to collect the two
products involving a single v limb. I'm not really familiar with how
current assembly loops are organized (if I ever looked into it, I'm
afraid I've forgotten...).

> You might want to take a look at the repo mpn/arm/v6/addmul_2.asm code.
> It avoids the long recurrency chain.

[...]

> Neat with just umaal and ld/st...

Definitely neat. I had a quick look, but I'll need a bit more time to
digest it.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.