arm "neon"
Richard Henderson
rth at twiddle.net
Sat Feb 23 21:05:08 CET 2013
On 2013-02-23 05:31, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
>
> Down to 5.8 cyc/limb. Good, but not fantastic. I'm gonna try one more time
> with larger unrolling to make full use of the vector load insns, and less
> over-prefetching.
>
> Good improvement!
>
> Keep in mind that addmul_ will be used for smallish count almost
> always. Prefetching will not help for typical use.
>
> The reason for that is that we use Karatsuba's algoritm for counts of
> over about 20.
Good to know. I won't base my tuning on tests/devel/addmul_N's default
of ~600 limbs then.
> Some comments about the addmul approaches:
>
> The name of the game for addmul_1 or addmul_2 is keeping the recurrency
> path shallow. When using mulacc insns, we should add in the rp[] data
> there, since that puts the mulacc off of the recurrency path.
I was adding in the rp data during the carry folding stage. But since
it added zero extra instructions to the carry folding stage, I figured
that was as good as anything.
> For addmul_(k) for large enough k every k (or perhaps every k/2 with
> 2-way SIMD) mulacc insn will be dependent. So here, mulacc will be on a
> recurrency path, but there will be k (or k/2) parallel recurrency paths.
>
> I believe the Neon approach will only be a real win for A15, where we
> should be able to approach 0.7 c/l. Not for addmul_2, but for
> addmul_(k), for some non-crazy k.
I'm going to try again with addmul_4 instead. Getting the same level of
parallelism from addmul_2 requires a main loop processing 16 limbs. At
which point, given your Karatsuba commend, we'll likely only traverse
the loop once.
The addmul_4 will require a minimum size=8 to enter the main 4x4 loop,
with a secondary 4x1 loop to clean up the last 7 limbs.
I suspect that an addmul_k, k >= 6 might need only a Kx1 main loop to
keep the pipes flowing.
> Interestingly enough, I suspect such code might also be reusable for
> A5x, since they have poor 64 x 64 -> 128 integer multiply support. If
> my information about the high-end A57 is correct, one such product can
> be formed every 7th cycle. If the Neon units are the same as in A15,
> Neon code would be twice as fast.
Interesting. I've only got access to an A15, so that's pretty much all
I care about atm. But these trials are on the list, if someone else
wants to do speed testing.
r~
More information about the gmp-devel
mailing list