rth at twiddle.net
Fri Feb 22 17:45:08 CET 2013
On 02/22/2013 02:32 AM, Torbjorn Granlund wrote:
> Richard Henderson <rth at twiddle.net> writes:
> Indeed, the last version that Niels posted doesn't pass this test.
> The following does pass, but if I'm to believe the arithmetic it's
> still fairly slow -- around 12cyc/sec.
> 12cyc/sec is a poor clock frequency. :-)
> Perhaps addmul_2 might not be easy to make fast for this target.
> I think an mul_basecase could be made to run at awesome speed. We might
> need a building block of at least addmul_4, more likely something
Perhaps. At least there's 32 registers to play around with.
(FWIW, I made a mistake in register choices in the version that I
posted. It's only d8-d15 that are call-saved. d0-d7 and d16-d31 are
> Neon has SIMD 32+32 -> 64 bit add. Assume we want to do (32+32)+32 or
> ((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is
> there good ISA support for that too? It might require an insn that does
> 32+64 -> 64.
The widening add insns are:
VADDL 32+32->64 Qd[n] = Dn[n] + Dm[n]
VADDW 64+32->64 Qd[n] = Qn[n] + Dm[n]
VPADDL 32+32->64 Qd[n/2] = Dn[2n] + Dn[2n+1] ("horizontal add")
VPADAL 32+32+64->64 Qd[n/2] += Dn[2n] + Dn[2n+1]
There is a narrowing add insn which might still be interesting:
VADDHN 64+64->32 Dd[n] = (Qn[n] + Qm[n]) >> 32
Suppose you're looking to do a final sum in a vector and will
subsequently be shifting the data for addition into the next column.
You have two choices:
vadd.i64 Qc0, Qa0, Qb0
vsra.i64 Qc1, Qc0, #32
vaddhn.i64 Dtmp, Qa0, Qb0
vaddw.u32 Qd1, Qc1, Dtmp
Such a re-ordering might be able to make data available for input
earlier. Or may be able to store data away in a single D register
rather than keeping around the double Q, easing register pressure.
More information about the gmp-devel