arm "neon"

Mon Jan 14 14:24:34 CET 2013

The corresponding code sustains one vmull.u32 per cycle on A15.  That's
4 times the bandwidth of its umul implementation.

It is usually tricky to make use of SIMD operations for addmul_(k) and
friends.  The well-designed ARM instructions will surely make it easier,
but it might still require many instructions for shuffling intermediates
around.

(Did you notice that VMUL allows multiplication for GF(2^n)?  That
should come in handy for Nettle.)

-- 
Torbjörn