arm "neon"
Torbjorn Granlund
tg at gmplib.org
Mon Jan 14 18:03:42 CET 2013
There are a few aspects worth noticing for prospective Neon hackers:
There are 32 64-bit register, available both in in "VFPv3-D32" and Neon.
(There are IIRC at least 4 levels of FP support, "VFP", "VFPv2",
"VFPv3-D16", and "VFPv3-D32"... I've seen references to "VFPv4" too.)
In Neon, the registers can also be addressed as 16 128-bit registers.
They are numbered Q0-Q15, so that e.g., Q3 overlaps D6,D7. This is
somewhat unusual naming.
The 64-bit ARM (AArch64) doubles the Neon register back to 32 128-bit
registers.
Note also the multitude of Neon integer addition instruction.
I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very
cooperative in that the accumulation dependency is very shallow.
E.g.,
.text
.globl main
.type main, #function
main: mov r0, #1694498816
1: subs r0, r0, #1
vmlal.u32 q1, d0, d0
vmlal.u32 q1, d0, d0
vmlal.u32 q1, d0, d0
vmlal.u32 q1, d0, d0
bne 1b
bx lr
runs just as fine as
.text
.globl main
.type main, #function
main: mov r0, #1694498816
1: subs r0, r0, #1
vmlal.u32 q1, d0, d0
vmlal.u32 q2, d0, d0
vmlal.u32 q3, d0, d0
vmlal.u32 q4, d0, d0
bne 1b
bx lr
in spite of that the first example creates an accumulation dependency.
The examples run at a throughput of 1 insn/cycle on A15 and 1/2
insn/cycle on A9.
(If one creates a dependency to the multiplicands, things run horribly,
as expected; the throughput drops to 1/5 and 1/6 insn/cycle,
respectively.)
--
Torbjörn
More information about the gmp-devel
mailing list