arm "neon"
Torbjorn Granlund
tg at gmplib.org
Thu Feb 21 11:20:20 CET 2013
nisse at lysator.liu.se (Niels Möller) writes:
Found the vmlal instruction now. Makes for a cute loop,
.Loop:
vld1.32 l01[1], [vp]!
vld1.32 {u00[]}, [up]!
vaddl.u32 q1, l01, c01
vmlal.u32 q1, u00, v01 C q1 overlaps with c01 and l01
subs n, #1
vst1.32 l01[0], [rp]!
vshr.u64 l01, l01, #32
bne .Loop
but still very slow, 18 cycles / iteration, or 9 c/l.
I suspect that some scheduling just might improve performance by a large
factor...
It looks strange that you load from vp and store to rp. The m4 aliasing
of SIMD registers might not be beneficial for reading the code...
How do you test you loops? I don't hink tests/devel/try.c handles
addmul_2. I recommend the old tests/devel/addmul_N.c, which is not
superelegant but will find subtle carry bugs quickly. Pass -DCLOCK=[Hz]
and -DN=2, and for faster non-timing tests also -DTIMES=1 (yes odd name,
means how many times the run the tested functions).
> One might need to use a bigger building block, say addmul_4, in order to
> deal with accumulation latency.
Maybe. That's going to be a larger project. How do you usually organizes
addmul_4? Do you have an iteration that multiplies all four v limbs by
the same ulimb, or something more fancy?
I am not sure addmul_4 is harder, since there will be more symmetries.
The code might actually be smaller since you can avoid deep sw
pipelining, and should only need 2-way unrolling at most. (2-way
unrolling helps for load latency scheduling, which will still be
necessary.)
How to organise addmul_4 depends so much on available insns that it is
hard to give detailed advice.
Clearly, you want to break dependencies between mul instructions but
also on add instructions.
I made some new experiments using this code and variants of it, filename
m.s:
.text
.align 2
.global main
.type main, %function
main:
mov r0, #0x3b800000
1: subs r0, r0, #1
vmlal.u32 q0, d31, d31
vaddl.u32 q8, d31, d31
vmlal.u32 q1, d31, d31
vaddl.u32 q9, d31, d31
vmlal.u32 q2, d31, d31
vaddl.u32 q10, d31, d31
vmlal.u32 q3, d31, d31
vaddl.u32 q11, d31, d31
bne 1b
bx lr
.size main, .-main
Command:
t=`rsync m.s parma: && ssh parma "gcc -marm -march=armv7-a -mfpu=neon m.s && time -p ./a.out" 2>&1 | grep user | awk '{print $2}'`; gexpr $t*1.7/8
The *1.7 is to compensate for the about 1e9 iterations and 1.7e9 clock
of parma. The /8 is since the loop is 4-way unrolled with 2-way SIMD.
Without the vaddl.u32, the mul throughput is 1 per vmlal or 2 per 32 x
32 -> 64 multiply on A15. A9 get exactly half of that (first replacing
the *1.7 by *1.0, since my A9 has a 1 GHz clock).
With the vaddl.u32 (as in the example above) performance drops somewhat,
but not to badly.
*If* loads and stores don't disturb the pipeline, and once latencies are
properly scheduled, we should be able to get close to these c/l numbers:
A9 1.5
A15 0.7
These numbers are absolutely awesome, of course.
The assumption that loads and stores don't disturb the pipeline might
not be too silly, since addmul_4 actually performs very little memops
per computation. And addmul_5, 6, 7, and 8 are even better. :-)
--
Torbjörn
More information about the gmp-devel
mailing list