ARM public key benchmark
tg at gmplib.org
Thu Apr 4 12:34:32 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
No speedup for addmul_1, unfortunately, but a saving for submul_1. Here
are new versions of both files (for mpn/arm/v6).
I sometimes get better A9 performance with *discrete* pointer updates,
not one-out-of-four autoincrement pointer updates like used here. I
think the code you started with had that one-out-of-four trick for str,
I wonder if this
submul_1 complement trick is useful on some other platforms too, e.g.,
Possibly. This is a trick I actually realised many years ago, so it
might very well already be used someplace in GMP.
I had on the other hand not realised David's ones complement + pre-invert
carry trick. I think that trick and this trick will result in the same
loop insn count on most subtraction challenged machines.
It is possible that this or similar tricks could be useful in other
contexts, such as the 2/1 or 3/2 quotient approximation primitives.
Running at 3.25 and 3.9 c/l on A9:
Cool! Looks like it is actually faster than 3.9 for some
Did you time this on some other CPU too? I have new submul_1 code for
A15 which runs at 2.75 c/l. It runs at 6.25 c/l on A9... Fewer
variants are always good, so it'd be nice if your code is faster
To squeeze the last out of this code there are a few things you might
want to try:
1. Use descrete ptr updates for up and/or rp.
2. Move the one-out-of-four autoincrement updates to other ldr/str
3. Use ldm/stm. Often an A9 win.
More information about the gmp-devel