ARM public key benchmark

Thu Apr 4 12:34:32 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  No speedup for addmul_1, unfortunately, but a saving for submul_1. Here
  are new versions of both files (for mpn/arm/v6).

I sometimes get better A9 performance with *discrete* pointer updates,
not one-out-of-four autoincrement pointer updates like used here.  I
think the code you started with had that one-out-of-four trick for str,
already?

  I wonder if this
  submul_1 complement trick is useful on some other platforms too, e.g.,
  64-bit sparc?

Possibly.  This is a trick I actually realised many years ago, so it
might very well already be used someplace in GMP.

I had on the other hand not realised David's ones complement + pre-invert
carry trick.  I think that trick and this trick will result in the same
loop insn count on most subtraction challenged machines.

It is possible that this or similar tricks could be useful in other
contexts, such as the 2/1 or 3/2 quotient approximation primitives.

  Running at 3.25 and 3.9 c/l on A9:

Cool!  Looks like it is actually faster than 3.9 for some
alignments/sizes.

Did you time this on some other CPU too?  I have new submul_1 code for
A15 which runs at 2.75 c/l.  It runs at 6.25 c/l on A9...  Fewer
variants are always good, so it'd be nice if your code is faster
everwhere.

To squeeze the last out of this code there are a few things you might
want to try:

1. Use descrete ptr updates for up and/or rp.
2. Move the one-out-of-four autoincrement updates to other ldr/str
   insns.
3. Use ldm/stm.  Often an A9 win.

-- 
Torbjörn