tg at gmplib.org
Fri Feb 22 11:32:44 CET 2013
Richard Henderson <rth at twiddle.net> writes:
Indeed, the last version that Niels posted doesn't pass this test.
The following does pass, but if I'm to believe the arithmetic it's
still fairly slow -- around 12cyc/sec.
12cyc/sec is a poor clock frequency. :-)
If one is even more clever than I, one could do a 4x unroll, making
best use of vld4. But when you do that, getting the carries right
becomes even more tricky. But I think any correct solution will
involve chains of vsra to shift and add up the chain.
Perhaps addmul_2 might not be easy to make fast for this target.
I think an mul_basecase could be made to run at awesome speed. We might
need a building block of at least addmul_4, more likely something
Neon has SIMD 32+32 -> 64 bit add. Assume we want to do (32+32)+32 or
((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
there good ISA support for that too? It might require an insn that does
32+64 -> 64.
The key here is accumulation support, as always with SIMD.
Without good such ISA support, we probably need more right shift
operations, which will damage performance.
More information about the gmp-devel