arm "neon"
Niels Möller
nisse at lysator.liu.se
Mon Jan 14 17:04:51 CET 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> nisse at lysator.liu.se (Niels Möller) writes:
>
> One can do two parallel umlaal using
>
> umlaal? You probably mean umul, umaal, or umlal...
I meant umaal. I'm getting confused by all these random-looking
instruction names...
> vmull.u32, computing u0*v0 and u1 *v0
> vaddl.u32, lo (chain variable) + r0 (result area)
> vadd.u64, add above to u0*v0
> vaddl.u32, hi (chain variable) + high half of above sum
> vadd.u64, add above sum to u1 * v0
>
> Looks like 6 cycles (which is poor, right?), excluding any data
> movement. And recurrency latency of four adds, which shouldn't be too
> bad, I imagine.
>
> I didn't read carefully, and I miss v1 multiplies.
The idea was that u0, u1 is the loop-invariant operand, and the above is
for one iteration processing only a single limb from v.
> The parallelism of addmul_(k) for k >= 2 should allow for shallow
> recurrency. One may handle the v1 products many cycles before the v0
> products. For Neon, we should surely accumulate into 64-bit "d" types,
> accumulating carry in bits 32, 33, etc.
A sum of 32-bit values can be accumulated into 64-bit register. But if
we want to accumulate 64-bit values, i.e., limb products, it gets
tricky.
> having a non-zero operand in the high part wouldn't work unless we use
> nails, since else it would overflow.
Agreed. I was suggesting something like
u1 * v0 | u0 * v0
+ c1 | + c0
---------------------
t3 t2 t1 t0
This can be done in parallel with a single vmlal. But next we need to
add those two 64-bit values with 32-bits overlap,
t1 t0
+ t3 t2
----------
c1 c0 r0
where r0 is the result word (old value also needs to be added in, if
it's addmul_2 rather than just mul_2), and c0, c1 are the recurrency
variables for the next iteration.
Maybe it's a poor way to think about addmul_2 to collect the two
products involving a single v limb. I'm not really familiar with how
current assembly loops are organized (if I ever looked into it, I'm
afraid I've forgotten...).
> You might want to take a look at the repo mpn/arm/v6/addmul_2.asm code.
> It avoids the long recurrency chain.
[...]
> Neat with just umaal and ld/st...
Definitely neat. I had a quick look, but I'll need a bit more time to
digest it.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list