ARM Neon multiplication (GNU and improved!)

Sun Apr 21 23:05:46 CEST 2013

I've been busy improving addmul_1 and submul_1 for Cortex-A15 lately.

It turned out to be possible to reach 2 c/l for addmul_1 using plain
(non-SIMD) operations; such code is in the repo since a few days.  The
trick was to move the recurrency path away from multiply-accumulate
instructions, and instead have just adcs (add-with-carry) on the path,
and also latency schedule things manually.

Ten days ago I posted about a 1.83 c/l mul_1 using Neon (i.e. SIMD)
instructions.  By moving from str to strd (thanks Richard Henderson for
the hint!) the code now runs at 1.48 c/l.  This is still not optimal, I
expect 1.25 c/l to be possible.

This code performs just multiplies in the SIMD side, and then
labouriously copies things to the core side, where things are added
again using adcs.

While mul_1 is important, it is nowhere close to as important as
addmul_1.  Can the new non-SIMD 2 c/l be beaten using SIMD ops?

It turns out that it can.  I've reached 1.65 c/l now with a loop similar
to the one used for mul_1, adjoining one vld1.32 and two vaddw.u32 for
each 4 limbs.

This is by far the best addmul_1 performance we have seen on any CPU.

The code is attached.  Note that it works just for n = 0 (mod 4).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm-neon-skel-mul_1-v2.asm
Type: application/octet-stream
Size: 4011 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130421/4f197497/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm-neon-skel-addmul_1-v2.asm
Type: application/octet-stream
Size: 5996 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130421/4f197497/attachment-0001.obj>
-------------- next part --------------

I expect there to still be much performance headroom for multiplication
on A15.  This CPU can do 2.5 32x32->64 multiplies per cycle, by running
SIMD and non-SIMD multiplies in parallel.  The A9 can do 1.5 multiplies
per cycle.  Our latest and greatest code has very poor multiply
utilisation, doing 0.61 and 0.48 multiply operations per cycle, for A15
and A9 respectively.  (A15 using the new SIMD addmul_1, A9 using an
older non-SIMD addmul_3.)

I have not looked into a mixed SIMD + non-SIMD addmul_k yet, but I
actually wouldn't be at all surprised if that will turn out to be
feasible.

The next step might be to look at a SIMD addmul_2.  If I am not much
mistaken, adjoining one vmull.u32 and two vaddw.u32 per two limbs to the
addmul_1 loop could do it, but there are many possibilities.  That would
almost certainly slow down the loop at most one cycle per limb, and thus
result in something considerably quicker than 1.65 c/l...

(An interesting conclusion from these experiments is that A57 addmul_1
should stay away from its fancy new 64-bit multiply instructions.  If my
information is correct, the umul-hi has a throughput of 1/4 per cycle,
while the mul-lo has a throughput of 1/3 per cycle, and furthermore
these instructions conflict with each other.  Therefore, the best we can
hope for for any addmul_k using these instructions is 7 cycles per
64-bit limb.  That's slower than our new 32-bit addmul_1, which
corresponds to 1.65*4 = 6.6 cycles per 64-bit limb.)

-- 
Torbj?rn