arm "neon"
Torbjorn Granlund
tg at gmplib.org
Fri Feb 22 11:32:44 CET 2013
Richard Henderson <rth at twiddle.net> writes:
Indeed, the last version that Niels posted doesn't pass this test.
Oops.
The following does pass, but if I'm to believe the arithmetic it's
still fairly slow -- around 12cyc/sec.
12cyc/sec is a poor clock frequency. :-)
If one is even more clever than I, one could do a 4x unroll, making
best use of vld4. But when you do that, getting the carries right
becomes even more tricky. But I think any correct solution will
involve chains of vsra to shift and add up the chain.
Perhaps addmul_2 might not be easy to make fast for this target.
I think an mul_basecase could be made to run at awesome speed. We might
need a building block of at least addmul_4, more likely something
larger.
Neon has SIMD 32+32 -> 64 bit add. Assume we want to do (32+32)+32 or
((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
there good ISA support for that too? It might require an insn that does
32+64 -> 64.
The key here is accumulation support, as always with SIMD.
Without good such ISA support, we probably need more right shift
operations, which will damage performance.
--
Torbjörn
More information about the gmp-devel
mailing list