rth at twiddle.net
Tue Feb 26 19:08:48 CET 2013
On 02/26/2013 05:24 AM, Torbjorn Granlund wrote:
> We should probably work out the latencies for the interesting
> instructions. That's not hard to do.
Testing for issue latency like this:
ldr r0, =1694100000
vmull.u32 q1, d0, d1
vmull.u32 q2, d0, d1
vmull.u32 q3, d0, d1
subs r0, #1
vmull.u32 q4, d0, d1
Output latency like this:
vmull.u32 q1, d0, d1
vmull.u32 q2, d2, d3
vmull.u32 q3, d4, d5
vmull.u32 q0, d6, d7
Then dividing the output of "time" by 4.
In the issue table I'll list pairs of independent insns, seeing which might be
vmull 1 5
vmlal 1 5
vadd.i64 [qd] 3/4 3
vpaddl 3/4 3
vuzp 7/4 4.5
vext 3/4 3
On the off-chance that there are various producer/consumer bypasses within a
given functional unit, but perhaps not across functional units, I'll list
output latency in a table.
vmlal<->vmlal accum 1
vadd<->vmlal accum 4
Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
not require the addend input until the 4th cycle, producing full output on the
5th. This seems to be the easiest way to hide a lot of output latency.
I'm not sure quite what's going on with the 3/4 issue rates. I really would
have expected to see either exactly 1, or very nearly 1/2, especially for vadd.
I did have a browse through gcc's scheduler description for a15 neon, and it
doesn't quite match up with the numbers I see here. Relevant entries:
(define_cpu_unit "ca15_cx_ij, ca15_cx_ik" "cortex_a15_neon")
There are two dispatch units for neon insns, J and K.
(define_cpu_unit "ca15_cx_ialu1, ca15_cx_ialu2" "cortex_a15_neon")
There are two arithmetic pipelines.
(define_reservation "ca15_cx_imac" "(ca15_cx_ij+ca15_cx_imac1)")
Multiply-accumulate must issue to J,
Add-accumulate (eg. vpadal) and shifts must issue to K.
(define_reservation "ca15_cx_perm" "ca15_cx_ij|ca15_cx_ik")
(define_reservation "ca15_cx_perm_2" "ca15_cx_ij+ca15_cx_ik")
(and (eq_attr "tune" "cortexa15")
Permute insns (eg. vext) must decompose to 3 micro-ops, because they take both
J and K dispatch units in the first cycle and then either J or K in the second
The scheduling description has the look of being auto-generated. No-one would
write names like
hand, on purpose. Does auto-generating mean it's more or less accurate?
More information about the gmp-devel