arm "neon"

Mon Jan 14 17:16:42 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  The idea was that u0, u1 is the loop-invariant operand, and the above is
  for one iteration processing only a single limb from v.

Ehum.  Perhaps we should change to that cnvention, but until we've done
that, sticking to the current will improve my understanding...

  A sum of 32-bit values can be accumulated into 64-bit register. But if
  we want to accumulate 64-bit values, i.e., limb products, it gets
  tricky.

It cannot be done, except with lots of contortions.

One can add 32-bit things to a 64-bit product without problems, at least
one may add two such things, since ((2^32-1)^2 + (2^32-1) + (2^32-1)) =
B^2 - 1 just fits a two-word accumulator.

  > having a non-zero operand in the high part wouldn't work unless we use
  > nails, since else it would overflow.

  Maybe it's a poor way to think about addmul_2 to collect the two
  products involving a single v limb. I'm not really familiar with how
  current assembly loops are organized (if I ever looked into it, I'm
  afraid I've forgotten...).

There are lots of variations...

  > Neat with just umaal and ld/st...

  Definitely neat. I had a quick look, but I'll need a bit more time to
  digest it.

Note that there are two *parallel* recurrency paths, one over over cya
and one over cyb.  Pairwise adjacent umaal have a dependency, but that's
of the benign, non-recurrent type.

-- 
Torbjörn