Neon addmul_8

Sun Feb 24 13:24:10 CET 2013

Richard Henderson <rth at twiddle.net> writes:

  gcc -O2 -g3 [...] addmul_N.c -DN=8 -DCLOCK=1694000000
  $ ./t.out
  mpn_addmul_8:     2845ms (1.782 cycles/limb) [973.59 Gb/s]
  mpn_addmul_8:     2620ms (1.641 cycles/limb) [1057.20 Gb/s]
  mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]
  mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]
  mpn_addmul_8:     2625ms (1.644 cycles/limb) [1055.19 Gb/s]

Wow!  You did it!

  (Clearly there's something wrong with the bandwith calculation.  ;-)

The "bandwidth" is the number of one-bit multiplies performed.  It makes
m-bit addmul and and n-bit addmul (m=32 adn n=64 for example)
comparable.

  It's somewhat unfortunate that neon load/store insns are not
  conditional.  I've made do with conditionally loading limbs into core
  registers and copying them over.

  It might be worth a try having two copies of this main loop, one in
  which there are more than 8 limbs remaining (requiring loads of the
  9th RP limb), and one for the final 8 limbs.  But I don't have time
  today, and my suspicion is that you'll get no better than 1.60
  cyc/limb, if indeed one can measure any improvement at all.

  Any real improvement on this routine will have to be algorithmic,
  using a different mechanism to compute and apply carries.  And that's
  where it's likely someone else that's going to have to be clever.

I might not be clever enough, but I have played with these things enough
to sometimes find new tricks.  But I determinedly avoid getting dragged
into any details in this period, since I have a dissertation to write.
If I get started, I won't stop.  :-)

I have not read your various addmul_k carefully enough to follow your
adding schemes, with the dissertation excuse.  I think I understand the
addmul_8 code, though.  It seems to work precisely as my itanic addmul_8
code (not complete, not released).

Your code looks simple enough to make any improvements hard.

It is a shame those vext insns are needed.  With all those neon addition
instruction variants, I would have thought vext + vpaddl could be done
using one insn...

One possible weakness of your code is that it separates multiplication
and addition.  Many pipelines prefer a more balanced load.

I realise that changing that might be easier said than done.  Perhaps
the last vpaddl could be folded into the beginning of the loop, and
perhaps the first vext could be foled into the last vmlals?

  What sizes are really interesting for addmul decomposition?

Do you mean k, as in addmul_k?  The smaller k we can find, the better.
Any addmul_k for k > 1 is a compromise, sort of.

But in most cases, increasing k is necessary for speed.

Currently, gmp/mpn/generic/mul_basecase.c supports up to k = 6.  It
should be obvious how to generailise it, though.

  Would it be useful do a prime like addmul_7?  It would run at pretty
  much the same speed as addmul_8, since nearly the same insns would be
  used.  Just the last vector would be a D register instead of a Q
  register.

Primes don't come into it, we're talking additive properties here.

The way addmul_ is used is best seen in gmp/mpn/generic/mul_basecase.c.

There is no real reason for implementing addmul_k for specific k.  If we
optimise for common RSA sizes, k=2^t might help a bit, since the
operands will then decompose into nice pieces.

-- 
Torbjörn