# ARM Neon multiplication

Torbjorn Granlund tg at gmplib.org
Wed Apr 10 12:49:19 CEST 2013

```Richard Henderson earlier wrote an addmul_8 which runs at an impressive
1.6 c/l (or something thereabout).

Since accumulating in Neen is somewhat tricky, I decided to try
alternatives.  I now have a mul_1 loop which performs multiplies using
neon insns and then adds the result using plain old adcs.  The loop is
very deeply pipelined, surely too much so.  Starting with a deep
pipeline helps with estimating performance headroom.

On A15, the loop runs at about 1.83 c/l.  This should be compared to
2.25 for the repo code.  Not a bad improvement, but perhaps it
translates to a real improvement only if we can shallow the pipeline
while maintaining the 1.83 c/l.

If an addmul_1 can be made to run at 1.83 c/l, that would be more
exciting, even with a deep pipeline.  We have a vmlal, which could help.
That would require loading from the rp[] vector to a 128-bit register to
create a state like {0,rp[i+1],0,rp[i]}.  I am not sure how to do that
efficiently.

Here is the new mul_1 loop:

L(top):	adcs	r12, r5, r6
vmov	  r4, r5, d0		C 0 1
str	r12, [rp], #4
adcs	r12, r7, r8
vmov	  r6, r7, d1		C 1 2
str	r12, [rp], #4
vmull.u32 q0, d16, v
adcs	r12, r9, r10
vmov	  r8, r9, d2		C 2 3
str	r12, [rp], #4
adcs	r12, r11, r4
vmov	  r10, r11, d3		C 3 4
str	r12, [rp], #4
vmull.u32 q1, d17, v
vld1.u32  {d16,d17}, [up]!
adcs	r12, r5, r6
vmov	  r4, r5, d4		C 4 5
str	r12, [rp], #4
adcs	r12, r7, r8
vmov	  r6, r7, d5		C 5 6
str	r12, [rp], #4
vmull.u32 q2, d18, v
adcs	r12, r9, r10
vmov	  r8, r9, d6		C 6 7
str	r12, [rp], #4
adcs	r12, r4, r11
vmov	  r10, r11, d7		C 7 8
str	r12, [rp], #4
vmull.u32 q3, d19, v
vld1.u32  {d18,d19}, [up]!
sub	n, n, #8
tst	n, n
bpl	L(top)

--
Torbjörn
```