arm "neon"

Thu Feb 21 11:20:20 CET 2013

nisse at lysator.liu.se (Niels Möller) writes:

  Found the vmlal instruction now. Makes for a cute loop,

  .Loop:
          vld1.32         l01[1], [vp]!
          vld1.32         {u00[]}, [up]!
          vaddl.u32       q1, l01, c01
          vmlal.u32       q1, u00, v01  C q1 overlaps with c01 and l01
          subs            n, #1
          vst1.32         l01[0], [rp]!
          vshr.u64        l01, l01, #32
          bne             .Loop

  but still very slow, 18 cycles / iteration, or 9 c/l.

I suspect that some scheduling just might improve performance by a large
factor...

It looks strange that you load from vp and store to rp.  The m4 aliasing
of SIMD registers might not be beneficial for reading the code...

How do you test you loops?  I don't hink tests/devel/try.c handles
addmul_2.  I recommend the old tests/devel/addmul_N.c, which is not
superelegant but will find subtle carry bugs quickly.  Pass -DCLOCK=[Hz]
and -DN=2, and for faster non-timing tests also -DTIMES=1 (yes odd name,
means how many times the run the tested functions).

  > One might need to use a bigger building block, say addmul_4, in order to
  > deal with accumulation latency.

  Maybe. That's going to be a larger project. How do you usually organizes
  addmul_4? Do you have an iteration that multiplies all four v limbs by
  the same ulimb, or something more fancy?

I am not sure addmul_4 is harder, since there will be more symmetries.
The code might actually be smaller since you can avoid deep sw
pipelining, and should only need 2-way unrolling at most.  (2-way
unrolling helps for load latency scheduling, which will still be
necessary.)

How to organise addmul_4 depends so much on available insns that it is
hard to give detailed advice.

Clearly, you want to break dependencies between mul instructions but
also on add instructions.

I made some new experiments using this code and variants of it, filename
m.s:

	.text
	.align	2
	.global	main
	.type	main, %function
main:
	mov	r0, #0x3b800000
1:	subs		r0, r0, #1
	vmlal.u32	q0, d31, d31
	vaddl.u32	q8, d31, d31
	vmlal.u32	q1, d31, d31
	vaddl.u32	q9, d31, d31
	vmlal.u32	q2, d31, d31
	vaddl.u32	q10, d31, d31
	vmlal.u32	q3, d31, d31
	vaddl.u32	q11, d31, d31
	bne		1b
	bx	lr
	.size	main, .-main

Command:
t=`rsync m.s parma: && ssh parma "gcc -marm -march=armv7-a -mfpu=neon m.s && time -p ./a.out" 2>&1 | grep user | awk '{print $2}'`; gexpr $t*1.7/8

The *1.7 is to compensate for the about 1e9 iterations and 1.7e9 clock
of parma.  The /8 is since the loop is 4-way unrolled with 2-way SIMD.

Without the vaddl.u32, the mul throughput is 1 per vmlal or 2 per 32 x
32 -> 64 multiply on A15.  A9 get exactly half of that (first replacing
the *1.7 by *1.0, since my A9 has a 1 GHz clock).

With the vaddl.u32 (as in the example above) performance drops somewhat,
but not to badly.

*If* loads and stores don't disturb the pipeline, and once latencies are
properly scheduled, we should be able to get close to these c/l numbers:

A9	1.5
A15	0.7

These numbers are absolutely awesome, of course.

The assumption that loads and stores don't disturb the pipeline might
not be too silly, since addmul_4 actually performs very little memops
per computation.  And addmul_5, 6, 7, and 8 are even better.  :-)

-- 
Torbjörn