arm "neon"
Torbjorn Granlund
tg at gmplib.org
Sat Feb 23 22:01:41 CET 2013
Richard Henderson <rth at twiddle.net> writes:
> The reason for that is that we use Karatsuba's algoritm for counts of
> over about 20.
Good to know. I won't base my tuning on tests/devel/addmul_N's
default of ~600 limbs then.
Using a larger count < L1size might still be useful for measuring loop
performance. With smaller counts, call overhead and loop overhead will
blur measurements.
> For addmul_(k) for large enough k every k (or perhaps every k/2 with
> 2-way SIMD) mulacc insn will be dependent. So here, mulacc will be on a
> recurrency path, but there will be k (or k/2) parallel recurrency paths.
>
> I believe the Neon approach will only be a real win for A15, where we
> should be able to approach 0.7 c/l. Not for addmul_2, but for
> addmul_(k), for some non-crazy k.
I'm going to try again with addmul_4 instead. Getting the same level
of parallelism from addmul_2 requires a main loop processing 16 limbs.
At which point, given your Karatsuba commend, we'll likely only
traverse the loop once.
If the mul_basecase cross product speed gets close to 1 c/l, we can
expect Karatsuba threshold to be perhaps 50 or 60. There is a lot
dynamics here.
With a clever inner loop, we can use the same patten in feed-in and
wind-down, getting good performance for the entire thing. (And in
mul_basecase, iterating an addmul_k building block, one may overlap a
wind-down with feed-in thing, making it look very much like the
innerloop except that we break dependencies.)
The addmul_4 will require a minimum size=8 to enter the main 4x4 loop,
with a secondary 4x1 loop to clean up the last 7 limbs.
I suspect that an addmul_k, k >= 6 might need only a Kx1 main loop to
keep the pipes flowing.
I suspect such an approach (i.e, smallest k for which no unrolling gives
great performance) might ultimately prove to be best, at least for the
critically important mul_basecase.
> Interestingly enough, I suspect such code might also be reusable for
> A5x, since they have poor 64 x 64 -> 128 integer multiply support. If
> my information about the high-end A57 is correct, one such product can
> be formed every 7th cycle. If the Neon units are the same as in A15,
> Neon code would be twice as fast.
Interesting. I've only got access to an A15, so that's pretty much
all I care about atm. But these trials are on the list, if someone
else wants to do speed testing.
I've got access to an A9 too, which has a worse Neon unit.
What I've heard about A57 is that it's innards basically is an A15, with
plain functional units extended to 64-bit, and of course new decoder and
a larger register bank. There is just one 32x32 -> 64 multiply with a
throughput of one op per cycle. An v8 mul insn will require 3 runs
through the multiply unit, an v8 umulh insn will require 4 runs through
the multiply unit.
To summarise:
32x32->64 32x32->64 64x64->128 64x64->128
scalar vector scalar vector
A9 1/2 1
A15 1/2 2
A57 1 2 1/7 2/7
We'll have to wait for A57 silicon to see if this is true.
--
Torbjörn
More information about the gmp-devel
mailing list