arm "neon"

Tue Feb 19 10:34:15 CET 2013

I have a really hard time getting my head around addmul_2.

The easiest (to me) way to think about addmul_2 is an iteration where we
have a double limb carry (c1, c0), and add

      c1 c0
         r0
      u0*v0
   u0*v1
------------
   c1 c0 r0

This can be done as two umaal,

  umaal r0, c0, u0, v0
  umaal c0, c1, u0, v1

Unfortunately, these have a dependency via c0, and don't fit simd
operations.

I have some difficulty following the current addmul_2 code, which
exhibits some independence. I think it's two addition chains,

         u0*v0      u0*v1
      u1*v0      u1*v1
   u2*v0      u2*v1

with separate chaining variables cya and cyb. And then the second chain
adds in the r limbs as the second add operand to umaal, while the first
chain adds in an appropriate low limb from the second chain and stores
back to r.

Another way to think about independence (probably equivalent to the
current code?) is to add in the next r limb earlier, and use three carry
limbs, two of which should be added in at the same position:

      c1 c0
      r1 c0'
      u0*v0
   u0*v1
------------
   c1 c0 r0
      c0'

If I get this right, that would be the following two umaal:

  umaal c0', c0, u0, v0
  str c0', [rp]
  ldr c0', [rp, #4]
  umaal  c0', c1, u0, v1

(which looks the same as the current code, so maybe I get something
right...)? With neon, if we put

  d0: v0, v1
  d1: u0, u0
  d2: c0, c1
  d3: c0', r1

we could do

  vmull.u32	q3, d0, d1	; Form products
  vaddl.u32	q4, d2, d3	; Add and extend carry inputs
  vadd.i64	q1, q3, q4

Now, q1 overlaps with d2, d3, so we get

  d2: c0, c1
  d3: r0, c0'

(not sure if I have the order right). So to prepare for the next
iteration, all we need to to is

* Rotate d3, moving c0' to the right position, store r0, and load r2, so
  we get

  d3: r2, c0'

  Needs vstr1, vld1, vshl (or maybe something more clever with vext, if
  we unroll and want to load and store two elements at a time).

* Load d1, so it contains two copies of u1, a single vld1, if I
  understand the manual correctly.

What do you think? Not sure if the same trick can be used with the simd
features of x86_32 or power (but as far as I'm aware sse in x86_64 lacks
a 64x64->128 multiply, and then it's pretty useless for multiplication).

/Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.