Neon addmul_8

Tue Feb 26 19:08:48 CET 2013

On 02/26/2013 05:24 AM, Torbjorn Granlund wrote:
> We should probably work out the latencies for the interesting
> instructions.  That's not hard to do.

Testing for issue latency like this:

	ldr	r0, =1694100000
0:
	vmull.u32	q1, d0, d1
	vmull.u32	q2, d0, d1
	vmull.u32	q3, d0, d1
	subs		r0, #1
	vmull.u32	q4, d0, d1
	bne		0b

Output latency like this:

	vmull.u32	q1, d0, d1
	vmull.u32	q2, d2, d3
	vmull.u32	q3, d4, d5
	vmull.u32	q0, d6, d7

Then dividing the output of "time" by 4.

In the issue table I'll list pairs of independent insns, seeing which might be
dual issueable.

		issue	output
vmull		1	5
vmlal		1	5
vadd.i64 [qd]	3/4	3
vpaddl		3/4	3
vuzp		7/4	4.5
vext		3/4	3

On the off-chance that there are various producer/consumer bypasses within a
given functional unit, but perhaps not across functional units, I'll list
output latency in a table.

			input
vmull->vuzp		5
vuzp->vpaddl		3
vmlal<->vmlal accum	1
vadd<->vmlal accum	4

Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
not require the addend input until the 4th cycle, producing full output on the
5th.  This seems to be the easiest way to hide a lot of output latency.

I'm not sure quite what's going on with the 3/4 issue rates.  I really would
have expected to see either exactly 1, or very nearly 1/2, especially for vadd.

I did have a browse through gcc's scheduler description for a15 neon, and it
doesn't quite match up with the numbers I see here.  Relevant entries:

	(define_cpu_unit "ca15_cx_ij, ca15_cx_ik" "cortex_a15_neon")

There are two dispatch units for neon insns, J and K.

	(define_cpu_unit "ca15_cx_ialu1, ca15_cx_ialu2" "cortex_a15_neon")

There are two arithmetic pipelines.

	(define_reservation "ca15_cx_imac" "(ca15_cx_ij+ca15_cx_imac1)")

Multiply-accumulate must issue to J,
Add-accumulate (eg. vpadal) and shifts must issue to K.

	(define_reservation "ca15_cx_perm" "ca15_cx_ij|ca15_cx_ik")
	(define_reservation "ca15_cx_perm_2" "ca15_cx_ij+ca15_cx_ik")
	(define_insn_reservation
	  "cortex_a15_neon_bp_simple" 4
	  (and (eq_attr "tune" "cortexa15")
	       (eq_attr "neon_type"
	              "neon_bp_simple"))
	  "ca15_issue3,ca15_ls+ca15_cx_perm_2,ca15_cx_perm")

Permute insns (eg. vext) must decompose to 3 micro-ops, because they take both
J and K dispatch units in the first cycle and then either J or K in the second
cycle.

The scheduling description has the look of being auto-generated.  No-one would
write names like
cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar by
hand, on purpose.  Does auto-generating mean it's more or less accurate?

Anyway,

r~