arm "neon"

Tue Feb 19 15:10:16 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  I have a really hard time getting my head around addmul_2.

It can indeed be tricky.

  The easiest (to me) way to think about addmul_2 is an iteration where we
  have a double limb carry (c1, c0), and add

        c1 c0
           r0
        u0*v0
     u0*v1
  ------------
     c1 c0 r0

  This can be done as two umaal,

    umaal r0, c0, u0, v0
    umaal c0, c1, u0, v1

  Unfortunately, these have a dependency via c0, and don't fit simd
  operations.

A dependency like that is not too bad, as long as all umaal instructions
are not chained together.  The current arm/v6/addmul_2 has such local
dependencies (through r4 and r5).

  I have some difficulty following the current addmul_2 code, which
  exhibits some independence. I think it's two addition chains,

           u0*v0      u0*v1
        u1*v0      u1*v1
     u2*v0      u2*v1

  with separate chaining variables cya and cyb. And then the second chain
  adds in the r limbs as the second add operand to umaal, while the first
  chain adds in an appropriate low limb from the second chain and stores
  back to r.

cya is the high word from v0 multiplies, cyb is the high from v1
multiplies.  An addmul_3 would need a 'cyc' for v2 multiplies...

The reason for this organisation is to create separate 'recurrency
paths', or interlieving of k multiplies for addmul_(k).

  Another way to think about independence (probably equivalent to the
  current code?) is to add in the next r limb earlier, and use three carry
  limbs, two of which should be added in at the same position:

  What do you think? Not sure if the same trick can be used with the simd
  features of x86_32 or power (but as far as I'm aware sse in x86_64 lacks
  a 64x64->128 multiply, and then it's pretty useless for multiplication).

I am afraid I cannot currently afford the hours it would take to digest
your reasoning here...

I spend many hours evaluating various addition schemes when writing
addmul_(k) loops.  SIMD makes things much harder, of course.  But ARM's
SIMD might be so well though out that we can actaully make use of it.

I think x86 SIMD multiply is still just 32 x 32 -> 64, even if perhaps
that will allow for 4-way SIMD in AVX2 (haven't checked).  Intel is
addressing bignum performance with the mulx and adcx+adox non-SIMD
instructions.  (It is clear how to write an addmul_1 with the two carry
flags of adcx+adox.  It seems harder to write an addmul_2 without extra
instructions for propagating carry.)

I counter your suggestions with an 8-page note on Itanic addmul_(k)
optimisation:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: itanium-gmp-mul_basecase.pdf
Type: application/pdf
Size: 158013 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130219/a22e2ea3/attachment-0001.pdf>
-------------- next part --------------

:-)

-- 
Torbjörn