ARM public key benchmark
Niels Möller
nisse at lysator.liu.se
Thu Apr 4 13:51:53 CEST 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> I sometimes get better A9 performance with *discrete* pointer updates,
> not one-out-of-four autoincrement pointer updates like used here. I
> think the code you started with had that one-out-of-four trick for str,
> already?
Right, it uses a single update of rp (which is used for both loads and
stores), I just changed it to handle up updates in a similar way.
> I had on the other hand not realised David's ones complement + pre-invert
> carry trick.
Not sure I understand what you are referring to here. I haven't been
following the sparc developments very closely (and I don't remember much
of sparc assembly).
> Cool! Looks like it is actually faster than 3.9 for some
> alignments/sizes.
It seems one iteration takes 15.5 cycles. I guess that means that even
and odd iterations are executed differently? It would be nice to get it
down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles,
there's no good reason the four additional mvn instructions should cost
more than two cycles). But I find instruction scheduling both very hard
and tedious.
> Did you time this on some other CPU too?
No. When I get home (I don't log in to the gmp machines from the office
network), I might get time to try it on the appropriate gmp machine.
> 1. Use descrete ptr updates for up and/or rp.
Maybe. Costs additional instructions, but with more freedom on where to
place the pointer increment. Loopmixer would help.
A digression: I'm running Debian GNU/Linux on my pandaboard system. The
Linux way to get access to the instruction counter seems to be via
"perf_event_open". However, when I tried it, it seems no hardware-based
events exist (I do get access to the software-based ones though, so the
interface is partially working). Also, clock_get_time with
CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire
"high res timers"-subsystem is non-working. Any clues on where to look
for solving this problem is appreciated. The obvious (to me) things seem
to be enabled in the kernel config
$ zgrep 'PERF_EV\|HIGH_RES' /proc/config.gz
CONFIG_HIGH_RES_TIMERS=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HW_PERF_EVENTS=y
And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.
> 2. Move the one-out-of-four autoincrement updates to other ldr/str
> insns.
Could try that.
> 3. Use ldm/stm. Often an A9 win.
If we want to schedule loads early, that seems to rule out using a
single ldm to load all values used in the loop. Right? But two ldm,
loading two limbs at a time, could work. stm seems easier. Any changes
of this type will break the current loop setup logic, I'm afraid.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list