ARM Neon multiplication

Wed Apr 10 12:49:19 CEST 2013

Richard Henderson earlier wrote an addmul_8 which runs at an impressive
1.6 c/l (or something thereabout).

Since accumulating in Neen is somewhat tricky, I decided to try
alternatives.  I now have a mul_1 loop which performs multiplies using
neon insns and then adds the result using plain old adcs.  The loop is
very deeply pipelined, surely too much so.  Starting with a deep
pipeline helps with estimating performance headroom.

On A15, the loop runs at about 1.83 c/l.  This should be compared to
2.25 for the repo code.  Not a bad improvement, but perhaps it
translates to a real improvement only if we can shallow the pipeline
while maintaining the 1.83 c/l.

If an addmul_1 can be made to run at 1.83 c/l, that would be more
exciting, even with a deep pipeline.  We have a vmlal, which could help.
That would require loading from the rp[] vector to a 128-bit register to
create a state like {0,rp[i+1],0,rp[i]}.  I am not sure how to do that
efficiently.

Here is the new mul_1 loop:

L(top):	adcs	r12, r5, r6
	vmov	  r4, r5, d0		C 0 1
	str	r12, [rp], #4
	adcs	r12, r7, r8
	vmov	  r6, r7, d1		C 1 2
	str	r12, [rp], #4
	vmull.u32 q0, d16, v
	adcs	r12, r9, r10
	vmov	  r8, r9, d2		C 2 3
	str	r12, [rp], #4
	adcs	r12, r11, r4
	vmov	  r10, r11, d3		C 3 4
	str	r12, [rp], #4
	vmull.u32 q1, d17, v
	vld1.u32  {d16,d17}, [up]!
	adcs	r12, r5, r6
	vmov	  r4, r5, d4		C 4 5
	str	r12, [rp], #4
	adcs	r12, r7, r8
	vmov	  r6, r7, d5		C 5 6
	str	r12, [rp], #4
	vmull.u32 q2, d18, v
	adcs	r12, r9, r10
	vmov	  r8, r9, d6		C 6 7
	str	r12, [rp], #4
	adcs	r12, r4, r11
	vmov	  r10, r11, d7		C 7 8
	str	r12, [rp], #4
	vmull.u32 q3, d19, v
	vld1.u32  {d18,d19}, [up]!
	sub	n, n, #8
	tst	n, n
	bpl	L(top)

-- 
Torbjörn