ARM public key benchmark

Thu Apr 4 13:51:53 CEST 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> I sometimes get better A9 performance with *discrete* pointer updates,
> not one-out-of-four autoincrement pointer updates like used here.  I
> think the code you started with had that one-out-of-four trick for str,
> already?

Right, it uses a single update of rp (which is used for both loads and
stores), I just changed it to handle up updates in a similar way.

> I had on the other hand not realised David's ones complement + pre-invert
> carry trick.

Not sure I understand what you are referring to here. I haven't been
following the sparc developments very closely (and I don't remember much
of sparc assembly).

> Cool!  Looks like it is actually faster than 3.9 for some
> alignments/sizes.

It seems one iteration takes 15.5 cycles. I guess that means that even
and odd iterations are executed differently? It would be nice to get it
down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles,
there's no good reason the four additional mvn instructions should cost
more than two cycles). But I find instruction scheduling both very hard
and tedious.

> Did you time this on some other CPU too?

No. When I get home (I don't log in to the gmp machines from the office
network), I might get time to try it on the appropriate gmp machine.

> 1. Use descrete ptr updates for up and/or rp.

Maybe. Costs additional instructions, but with more freedom on where to
place the pointer increment. Loopmixer would help.

A digression: I'm running Debian GNU/Linux on my pandaboard system. The
Linux way to get access to the instruction counter seems to be via
"perf_event_open". However, when I tried it, it seems no hardware-based
events exist (I do get access to the software-based ones though, so the
interface is partially working). Also, clock_get_time with
CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire
"high res timers"-subsystem is non-working. Any clues on where to look
for solving this problem is appreciated. The obvious (to me) things seem
to be enabled in the kernel config

  $ zgrep 'PERF_EV\|HIGH_RES' /proc/config.gz
  CONFIG_HIGH_RES_TIMERS=y
  CONFIG_HAVE_PERF_EVENTS=y
  CONFIG_PERF_EVENTS=y
  CONFIG_HW_PERF_EVENTS=y

And it's no use to even think of porting the loop mixer to arm without
access to cycle-accurate timing.

> 2. Move the one-out-of-four autoincrement updates to other ldr/str
>    insns.

Could try that.

> 3. Use ldm/stm.  Often an A9 win.

If we want to schedule loads early, that seems to rule out using a
single ldm to load all values used in the loop. Right? But two ldm,
loading two limbs at a time, could work. stm seems easier. Any changes
of this type will break the current loop setup logic, I'm afraid.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.