ARM public key benchmark

Thu Apr 4 14:33:43 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  > I had on the other hand not realised David's ones complement + pre-invert
  > carry trick.

  Not sure I understand what you are referring to here. I haven't been
  following the sparc developments very closely (and I don't remember much
  of sparc assembly).

The newer sparc adds 64-bit carrying adds, but they still don't have
corresponding subtraction instructions.  Se David sets carry before
entering the loop, and ones complements the subtrahend.

  > Cool!  Looks like it is actually faster than 3.9 for some
  > alignments/sizes.

  It seems one iteration takes 15.5 cycles. I guess that means that even
  and odd iterations are executed differently?

Probably.  Or that it is limping in some other odd way, such as cache
line related.  But your explanation is the most likely one, I think.

  It would be nice to get it
  down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles,
  there's no good reason the four additional mvn instructions should cost
  more than two cycles). But I find instruction scheduling both very hard
  and tedious.

It is hard and tedious, but it really pays off if one is persistent
enough (or has some tool).

The A9 pipeline can execute two mvn each cycle.

  > 1. Use descrete ptr updates for up and/or rp.

  Maybe. Costs additional instructions, but with more freedom on where to
  place the pointer increment. Loopmixer would help.

I have never seen a case where a separate insn adds execution time.  I
suspect the hardware executes ld with autoupdate as two insns, at least
on some ARMs.  (It should work to execute st with autoupdate within the
usual read-two-regs, write-one-reg, unlke ld.)

  A digression: I'm running Debian GNU/Linux on my pandaboard system. The
  Linux way to get access to the instruction counter seems to be via
  "perf_event_open". However, when I tried it, it seems no hardware-based
  events exist (I do get access to the software-based ones though, so the
  interface is partially working). Also, clock_get_time with
  CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire
  "high res timers"-subsystem is non-working. Any clues on where to look
  for solving this problem is appreciated. The obvious (to me) things seem
  to be enabled in the kernel config

There is an annoying Linux tradition of "implementing" things and then
not make them work for years.  Clocks and timing has always been a sore
area for Linux.  This is why almost all gmplib machines run BSD, where
things actually work.

I have tried to get cycle counters to work on both my ARM systems,
following various examples ("HOWTOs").  Nothing works.  I will not waste
more time on this, but as soon as *BSD is available for Panda or
Chromebook, I will migrate to it.

  And it's no use to even think of porting the loop mixer to arm without
  access to cycle-accurate timing.

That would indeed make it less useful.  I suppose it could still be made
to work by running each sequence enough times for some Linux counter to
be updated.

  > 3. Use ldm/stm.  Often an A9 win.

  If we want to schedule loads early, that seems to rule out using a
  single ldm to load all values used in the loop. Right? But two ldm,
  loading two limbs at a time, could work. stm seems easier. Any changes
  of this type will break the current loop setup logic, I'm afraid.

I assume that ldm loads the registers in some secific order, such as
lowest numbered first.  Then, it could lift the screboard bit for
availale register values while ldm executes.

Using ldm with just two register might be pointless.  Also, it will for
50% of alignments take 2 cycles.  Doing three registers is (as we've
discussed in the past) more applealing.

I haven't explored if ldm is a win for A9 compared to well placed
discrete loads.  On A15 ldm seems pretty useless, but it is not harmful
either.

-- 
Torbjörn