ARM Neon multiplication
Torbjorn Granlund
tg at gmplib.org
Wed Apr 10 12:49:19 CEST 2013
Richard Henderson earlier wrote an addmul_8 which runs at an impressive
1.6 c/l (or something thereabout).
Since accumulating in Neen is somewhat tricky, I decided to try
alternatives. I now have a mul_1 loop which performs multiplies using
neon insns and then adds the result using plain old adcs. The loop is
very deeply pipelined, surely too much so. Starting with a deep
pipeline helps with estimating performance headroom.
On A15, the loop runs at about 1.83 c/l. This should be compared to
2.25 for the repo code. Not a bad improvement, but perhaps it
translates to a real improvement only if we can shallow the pipeline
while maintaining the 1.83 c/l.
If an addmul_1 can be made to run at 1.83 c/l, that would be more
exciting, even with a deep pipeline. We have a vmlal, which could help.
That would require loading from the rp[] vector to a 128-bit register to
create a state like {0,rp[i+1],0,rp[i]}. I am not sure how to do that
efficiently.
Here is the new mul_1 loop:
L(top): adcs r12, r5, r6
vmov r4, r5, d0 C 0 1
str r12, [rp], #4
adcs r12, r7, r8
vmov r6, r7, d1 C 1 2
str r12, [rp], #4
vmull.u32 q0, d16, v
adcs r12, r9, r10
vmov r8, r9, d2 C 2 3
str r12, [rp], #4
adcs r12, r11, r4
vmov r10, r11, d3 C 3 4
str r12, [rp], #4
vmull.u32 q1, d17, v
vld1.u32 {d16,d17}, [up]!
adcs r12, r5, r6
vmov r4, r5, d4 C 4 5
str r12, [rp], #4
adcs r12, r7, r8
vmov r6, r7, d5 C 5 6
str r12, [rp], #4
vmull.u32 q2, d18, v
adcs r12, r9, r10
vmov r8, r9, d6 C 6 7
str r12, [rp], #4
adcs r12, r4, r11
vmov r10, r11, d7 C 7 8
str r12, [rp], #4
vmull.u32 q3, d19, v
vld1.u32 {d18,d19}, [up]!
sub n, n, #8
tst n, n
bpl L(top)
--
Torbjörn
More information about the gmp-devel
mailing list