arm "neon"

Sat Feb 23 21:05:08 CET 2013

On 2013-02-23 05:31, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
>    Down to 5.8 cyc/limb.  Good, but not fantastic.  I'm gonna try one more time
>    with larger unrolling to make full use of the vector load insns, and less
>    over-prefetching.
>
> Good improvement!
>
> Keep in mind that addmul_ will be used for smallish count almost
> always.  Prefetching will not help for typical use.
>
> The reason for that is that we use Karatsuba's algoritm for counts of
> over about 20.

Good to know.  I won't base my tuning on tests/devel/addmul_N's default 
of ~600 limbs then.

> Some comments about the addmul approaches:
>
> The name of the game for addmul_1 or addmul_2 is keeping the recurrency
> path shallow.  When using mulacc insns, we should add in the rp[] data
> there, since that puts the mulacc off of the recurrency path.

I was adding in the rp data during the carry folding stage.  But since 
it added zero extra instructions to the carry folding stage, I figured 
that was as good as anything.

> For addmul_(k) for large enough k every k (or perhaps every k/2 with
> 2-way SIMD) mulacc insn will be dependent.  So here, mulacc will be on a
> recurrency path, but there will be k (or k/2) parallel recurrency paths.
>
> I believe the Neon approach will only be a real win for A15, where we
> should be able to approach 0.7 c/l.  Not for addmul_2, but for
> addmul_(k), for some non-crazy k.

I'm going to try again with addmul_4 instead.  Getting the same level of 
parallelism from addmul_2 requires a main loop processing 16 limbs.  At 
which point, given your Karatsuba commend, we'll likely only traverse 
the loop once.

The addmul_4 will require a minimum size=8 to enter the main 4x4 loop, 
with a secondary 4x1 loop to clean up the last 7 limbs.

I suspect that an addmul_k, k >= 6 might need only a Kx1 main loop to 
keep the pipes flowing.

> Interestingly enough, I suspect such code might also be reusable for
> A5x, since they have poor 64 x 64 -> 128 integer multiply support.  If
> my information about the high-end A57 is correct, one such product can
> be formed every 7th cycle.  If the Neon units are the same as in A15,
> Neon code would be twice as fast.

Interesting.  I've only got access to an A15, so that's pretty much all 
I care about atm.  But these trials are on the list, if someone else 
wants to do speed testing.

r~