arm "neon"

Mon Jan 14 18:03:42 CET 2013

There are a few aspects worth noticing for prospective Neon hackers:

There are 32 64-bit register, available both in in "VFPv3-D32" and Neon.
(There are IIRC at least 4 levels of FP support, "VFP", "VFPv2",
"VFPv3-D16", and "VFPv3-D32"...  I've seen references to "VFPv4" too.)

In Neon, the registers can also be addressed as 16 128-bit registers.
They are numbered Q0-Q15, so that e.g., Q3 overlaps D6,D7.  This is
somewhat unusual naming.

The 64-bit ARM (AArch64) doubles the Neon register back to 32 128-bit
registers.

Note also the multitude of Neon integer addition instruction.

I played with vmlal.u32 on A9 and A15.  Surprisingly, both CPUs are very
cooperative in that the accumulation dependency is very shallow.

E.g.,

                .text
                .globl  main
                .type   main, #function
        main:   mov     r0, #1694498816
        1:      subs    r0, r0, #1
                vmlal.u32       q1, d0, d0
                vmlal.u32       q1, d0, d0
                vmlal.u32       q1, d0, d0
                vmlal.u32       q1, d0, d0
                bne     1b
                bx      lr

runs just as fine as

                .text
                .globl  main
                .type   main, #function
        main:   mov     r0, #1694498816
        1:      subs    r0, r0, #1
                vmlal.u32       q1, d0, d0
                vmlal.u32       q2, d0, d0
                vmlal.u32       q3, d0, d0
                vmlal.u32       q4, d0, d0
                bne     1b
                bx      lr

in spite of that the first example creates an accumulation dependency.

The examples run at a throughput of 1 insn/cycle on A15 and 1/2
insn/cycle on A9.

(If one creates a dependency to the multiplicands, things run horribly,
as expected; the throughput drops to 1/5 and 1/6 insn/cycle,
respectively.)

-- 
Torbjörn