Neon addmul_8

Sun Feb 24 23:09:00 CET 2013

On 2013-02-24 10:39, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
>    Yeah, but I use up the entire multiplication portion doing the integer
>    bookkeeping for the round.  I want to get that done asap so that the
>    load instructions for the next round are issued as early as
>    possible. And as far as I know, no ARM pipeline does more than dual
>    issue.
>
> Isn't A15 3-issue?

Well, I'm working off what looks like a slide-deck presentation rather 
than proper documentation, but you're right it can do 3.

>    > The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.
>
>    I see.  And any amount of adding under the toom limit is reasonable?
>
> Sorry, I don't understand.

I mean that one shouldn't optimize for anything beyond the Toom22 
cutoff.  Which is currently set to 31 for cortexa15.

> An addmul_14?  Zany.  :-)

That just happens to be the limit of what can be held in the 24 
call-clobbered 64-bit registers.  ;-)  Under the perhaps naive 
observation that the more we hold in registers the less we have to keep 
re-reading from memory.

> Let's start by identifying some building blocks.
>
> IL = Inner loop
> FI = Feed-in code for IL
> WD = Wind-down code for IL
...
> For mul_basecase we can overlap the previous iteration WD with the new
> iteration FI, which will save many, many cycles.
>
> |\
> | \
> |  \
> |FI \
> |____\
> |     |
> |     |
> | IL  |
> |     |
> |_____|
> |\    |
> | \WD |
> |  \  |
> |FI \ |
> |____\|
> |     |
> |     |
> | IL  |
> |     |
> |_____|
> |\    |
> | \WD |
> |  \  |
> |FI \ |
> |____\|

Clever.  The FI portion is, with the current code, very "flat", as we 
can load 8 limbs with a single instruction.  But the WD section is 
fairly long, and overlapping the two could hide a lot of L2 load latency.

> Let's consider vertical summation for a while!
>
> If we form product sums like
>
> S_i = u_{k-1} * v_{0} + u_{k-2} * v{1} + ... + u_{0} * v{k-1}
>
> akin to a convolution, we will (as long as k is kept < 2^b for a b-bit
> word size) will form a 3-word sum S_i.

The tricky bit as I see it is the "3-word" bit.  That's easy in the 
integer unit with adcs, but much harder in the neon unit.

I do wonder about using the VTRN instruction to swap some of the 
elements around though...

				@ d0 = { u1 u0 }, d1 = { v1 v2 }
	vtrn	d0, d0		@ d0 = { u0 u1 }
	vmlal	q2, d0, d1	@ q2 = { v1*u0, v0*u1 }

which does get us data that belongs in the same result column.

But we can't just use the obvious VPADD without losing the carry.  This 
is where an add-then-right-shift aka carry-out insn is really missing.

One possible way is

	@ q0 = { u0*v1, u1*v0 }
	@    = { a b c d }
	@ q1 = { u0*v3, u1*v2 }
	@    = { w x y z }

	vunzp	q0, q1

	@ q0 = { x z b d }
	@    = d0 = { lo(u0*v3), lo(u1*v2) }
	@      d1 = { lo(u0*v1), lo(u1*v0) }
	@ q1 = { w y a c }
	@    = d2 = { hi(u0*v3), hi(u1*v2) }
	@    = d3 = { hi(u0*v1), hi(u1*v0) }

	vpaddl	q0, q0
	vpaddl	q1, q1

	@ q0 = { lo(u0*v3) + lo(u1*v2), lo(u0*v1) + lo(u1*v0) }
	@ q1 = { hi(u0*v3) + hi(u1*v2), hi(u0*v1) + hi(u1*v0) }

Now the elements of these vectors are max 33 bits, but we've yet to 
actually carry out.  We can either use VSRA to add the carry 
immediately, VSHRN to store the carry away in a D reg, or more VUNZP or 
VEXT trickery.

But all this suggests another way to look at e.g. addmul_4...

r~