Neon addmul_8

Mon Feb 25 10:56:37 CET 2013

Richard Henderson <rth at twiddle.net> writes:

>> An addmul_14?  Zany.  :-)
>
> That just happens to be the limit of what can be held in the 24
> call-clobbered 64-bit registers.  ;-)

Are you sure you really need to go that far? I haven't read all your
loops very carefully, but as far as I understood, one iteration of your
addmul_4 loop processes 4 limbs from u (and then the 4 invariant limbs
of v), and similarly, the addmul_8 loop processes 8 limbs from u per
iteration. But the k in admul_k and the number of u limbs processed in a
single iterations doesn't have to be tied together, and there are
different reasons why they may need to be increased.

The way I understand these things (based on Torbjörn's teaching me ;-) ),
the bottleneck of addmul_1 is usually the latency of the operatons
involving the carry, the "critical path" starting with the first
instruction which depends on the carry from the previous iteration, and
ending with the final iteration producing the carry input for the next
iteration.

By doing addmul_k for k > 1, you make the recurrency more shallow, you
can organize it as one chain involving v_0 * u_k and corresponding
carry, another involving v_1 * u_k, and so on, with only an *acyclic*
dependence between the chains. So for addmul_2, the latency of the
critical path need not be much longer than for addmul_1, but you do
twice the amount of multiplication work per iteration.

And for some not insanely large k, the latency of the critical path is
no longer the most narrow bottleneck, instead performance is limited by
multiplication throughput or maybe instruction decoding resources. If
you can arrange to have the input carry limbs added in late in the
addition chain, then you can get by with a smaller k, and conversely, if
you add in the carry limbs early (with vmlal), you may need a higher k.

And once you have large enough k, you may want to unroll the loop a few
times, processing more than one limb from u, to reduce loop overhead,
but I suspect 8-way unrolling of addmul_8 is overkill. An iteration
where you multiply each of the eight v_k by the same limb u_0, and
propagate the various carry limbs, may well have sufficient independence
between operations, and small enough loop overhead, that you don't need
any further unrolling.

> Under the perhaps naive observation that the more we hold in registers
> the less we have to keep re-reading from memory.

As far as I understand, the amount of loads and stores for addmul_k and
mul_basecase should be independent of the amount of unrolling (i.e., the
number of u limbs processed per iteration). But I guess it may still be
advantageous to read and write in larger units than 32 bits.

For other (non-neon) ARM loops, Torbjörn has suggested using 3-way
unrolling, because ldm and stm can apparently do 3 32-bit words in 2
cycles regardless of alignment of the pointers (if one is willing to
arrange that pointers are 64-bit aligned, one can do 2 32-bit words in a
single cycle). But my gut feeling is that load and store bandwidth will
not be a bottleneck for addmul_k.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.