Neon addmul_8
Richard Henderson
rth at twiddle.net
Sun Feb 24 23:09:00 CET 2013
On 2013-02-24 10:39, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
> Yeah, but I use up the entire multiplication portion doing the integer
> bookkeeping for the round. I want to get that done asap so that the
> load instructions for the next round are issued as early as
> possible. And as far as I know, no ARM pipeline does more than dual
> issue.
>
> Isn't A15 3-issue?
Well, I'm working off what looks like a slide-deck presentation rather
than proper documentation, but you're right it can do 3.
> > The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.
>
> I see. And any amount of adding under the toom limit is reasonable?
>
> Sorry, I don't understand.
I mean that one shouldn't optimize for anything beyond the Toom22
cutoff. Which is currently set to 31 for cortexa15.
> An addmul_14? Zany. :-)
That just happens to be the limit of what can be held in the 24
call-clobbered 64-bit registers. ;-) Under the perhaps naive
observation that the more we hold in registers the less we have to keep
re-reading from memory.
> Let's start by identifying some building blocks.
>
> IL = Inner loop
> FI = Feed-in code for IL
> WD = Wind-down code for IL
...
> For mul_basecase we can overlap the previous iteration WD with the new
> iteration FI, which will save many, many cycles.
>
> |\
> | \
> | \
> |FI \
> |____\
> | |
> | |
> | IL |
> | |
> |_____|
> |\ |
> | \WD |
> | \ |
> |FI \ |
> |____\|
> | |
> | |
> | IL |
> | |
> |_____|
> |\ |
> | \WD |
> | \ |
> |FI \ |
> |____\|
Clever. The FI portion is, with the current code, very "flat", as we
can load 8 limbs with a single instruction. But the WD section is
fairly long, and overlapping the two could hide a lot of L2 load latency.
> Let's consider vertical summation for a while!
>
> If we form product sums like
>
> S_i = u_{k-1} * v_{0} + u_{k-2} * v{1} + ... + u_{0} * v{k-1}
>
> akin to a convolution, we will (as long as k is kept < 2^b for a b-bit
> word size) will form a 3-word sum S_i.
The tricky bit as I see it is the "3-word" bit. That's easy in the
integer unit with adcs, but much harder in the neon unit.
I do wonder about using the VTRN instruction to swap some of the
elements around though...
@ d0 = { u1 u0 }, d1 = { v1 v2 }
vtrn d0, d0 @ d0 = { u0 u1 }
vmlal q2, d0, d1 @ q2 = { v1*u0, v0*u1 }
which does get us data that belongs in the same result column.
But we can't just use the obvious VPADD without losing the carry. This
is where an add-then-right-shift aka carry-out insn is really missing.
One possible way is
@ q0 = { u0*v1, u1*v0 }
@ = { a b c d }
@ q1 = { u0*v3, u1*v2 }
@ = { w x y z }
vunzp q0, q1
@ q0 = { x z b d }
@ = d0 = { lo(u0*v3), lo(u1*v2) }
@ d1 = { lo(u0*v1), lo(u1*v0) }
@ q1 = { w y a c }
@ = d2 = { hi(u0*v3), hi(u1*v2) }
@ = d3 = { hi(u0*v1), hi(u1*v0) }
vpaddl q0, q0
vpaddl q1, q1
@ q0 = { lo(u0*v3) + lo(u1*v2), lo(u0*v1) + lo(u1*v0) }
@ q1 = { hi(u0*v3) + hi(u1*v2), hi(u0*v1) + hi(u1*v0) }
Now the elements of these vectors are max 33 bits, but we've yet to
actually carry out. We can either use VSRA to add the carry
immediately, VSHRN to store the carry away in a D reg, or more VUNZP or
VEXT trickery.
But all this suggests another way to look at e.g. addmul_4...
r~
More information about the gmp-devel
mailing list