arm "neon"
Torbjorn Granlund
tg at gmplib.org
Mon Jan 14 17:16:42 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
The idea was that u0, u1 is the loop-invariant operand, and the above is
for one iteration processing only a single limb from v.
Ehum. Perhaps we should change to that cnvention, but until we've done
that, sticking to the current will improve my understanding...
A sum of 32-bit values can be accumulated into 64-bit register. But if
we want to accumulate 64-bit values, i.e., limb products, it gets
tricky.
It cannot be done, except with lots of contortions.
One can add 32-bit things to a 64-bit product without problems, at least
one may add two such things, since ((2^32-1)^2 + (2^32-1) + (2^32-1)) =
B^2 - 1 just fits a two-word accumulator.
> having a non-zero operand in the high part wouldn't work unless we use
> nails, since else it would overflow.
Maybe it's a poor way to think about addmul_2 to collect the two
products involving a single v limb. I'm not really familiar with how
current assembly loops are organized (if I ever looked into it, I'm
afraid I've forgotten...).
There are lots of variations...
> Neat with just umaal and ld/st...
Definitely neat. I had a quick look, but I'll need a bit more time to
digest it.
Note that there are two *parallel* recurrency paths, one over over cya
and one over cyb. Pairwise adjacent umaal have a dependency, but that's
of the benign, non-recurrent type.
--
Torbjörn
More information about the gmp-devel
mailing list