arm "neon"
Torbjorn Granlund
tg at gmplib.org
Tue Feb 19 15:10:16 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
I have a really hard time getting my head around addmul_2.
It can indeed be tricky.
The easiest (to me) way to think about addmul_2 is an iteration where we
have a double limb carry (c1, c0), and add
c1 c0
r0
u0*v0
u0*v1
------------
c1 c0 r0
This can be done as two umaal,
umaal r0, c0, u0, v0
umaal c0, c1, u0, v1
Unfortunately, these have a dependency via c0, and don't fit simd
operations.
A dependency like that is not too bad, as long as all umaal instructions
are not chained together. The current arm/v6/addmul_2 has such local
dependencies (through r4 and r5).
I have some difficulty following the current addmul_2 code, which
exhibits some independence. I think it's two addition chains,
u0*v0 u0*v1
u1*v0 u1*v1
u2*v0 u2*v1
with separate chaining variables cya and cyb. And then the second chain
adds in the r limbs as the second add operand to umaal, while the first
chain adds in an appropriate low limb from the second chain and stores
back to r.
cya is the high word from v0 multiplies, cyb is the high from v1
multiplies. An addmul_3 would need a 'cyc' for v2 multiplies...
The reason for this organisation is to create separate 'recurrency
paths', or interlieving of k multiplies for addmul_(k).
Another way to think about independence (probably equivalent to the
current code?) is to add in the next r limb earlier, and use three carry
limbs, two of which should be added in at the same position:
What do you think? Not sure if the same trick can be used with the simd
features of x86_32 or power (but as far as I'm aware sse in x86_64 lacks
a 64x64->128 multiply, and then it's pretty useless for multiplication).
I am afraid I cannot currently afford the hours it would take to digest
your reasoning here...
I spend many hours evaluating various addition schemes when writing
addmul_(k) loops. SIMD makes things much harder, of course. But ARM's
SIMD might be so well though out that we can actaully make use of it.
I think x86 SIMD multiply is still just 32 x 32 -> 64, even if perhaps
that will allow for 4-way SIMD in AVX2 (haven't checked). Intel is
addressing bignum performance with the mulx and adcx+adox non-SIMD
instructions. (It is clear how to write an addmul_1 with the two carry
flags of adcx+adox. It seems harder to write an addmul_2 without extra
instructions for propagating carry.)
I counter your suggestions with an 8-page note on Itanic addmul_(k)
optimisation:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itanium-gmp-mul_basecase.pdf
Type: application/pdf
Size: 158013 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130219/a22e2ea3/attachment-0001.pdf>
-------------- next part --------------
:-)
--
Torbjörn
More information about the gmp-devel
mailing list