ARM public key benchmark

Torbjorn Granlund tg at
Thu Apr 4 12:34:32 CEST 2013

nisse at (Niels Möller) writes:

  No speedup for addmul_1, unfortunately, but a saving for submul_1. Here
  are new versions of both files (for mpn/arm/v6).

I sometimes get better A9 performance with *discrete* pointer updates,
not one-out-of-four autoincrement pointer updates like used here.  I
think the code you started with had that one-out-of-four trick for str,

  I wonder if this
  submul_1 complement trick is useful on some other platforms too, e.g.,
  64-bit sparc?
Possibly.  This is a trick I actually realised many years ago, so it
might very well already be used someplace in GMP.

I had on the other hand not realised David's ones complement + pre-invert
carry trick.  I think that trick and this trick will result in the same
loop insn count on most subtraction challenged machines.

It is possible that this or similar tricks could be useful in other
contexts, such as the 2/1 or 3/2 quotient approximation primitives.

  Running at 3.25 and 3.9 c/l on A9:
Cool!  Looks like it is actually faster than 3.9 for some

Did you time this on some other CPU too?  I have new submul_1 code for
A15 which runs at 2.75 c/l.  It runs at 6.25 c/l on A9...  Fewer
variants are always good, so it'd be nice if your code is faster

To squeeze the last out of this code there are a few things you might
want to try:

1. Use descrete ptr updates for up and/or rp.
2. Move the one-out-of-four autoincrement updates to other ldr/str
3. Use ldm/stm.  Often an A9 win.


More information about the gmp-devel mailing list