ARM public key benchmark

Wed Apr 3 14:05:47 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  >   1. I guess one can expect submul_1 to always be a bit slower than
  >      addmul_1, since submul_1 needs additional arithmetics besides the
  >      umaal? One could perhaps do some negations on the fly, a - b C = -
  >      ((-a) + b*C), maybe that would be advantageous?
  >   
  > I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very
  > large difference between addmul_1 and submul_1.

  After some further thinking, it should work fine with one's complement
  rather than two's complement for the negations,

    a - b*C = ~(b*C + ~a)  (if we do the complements on n+1 limbs)

  So it should be doable with the addmul_1 loop and two additional,
  non-recurrency, not instructions per limb, and then maybe some extra
  logic for the return value. One could aim for 4.25 c/l, I guess.

Just send me the code.  :-)

Have you considered complementing C instead?

  > I've never considered addmul_1/submul_1 as alternatives to
  > cnd_add_n/cnd_sub_n.

  But they are, except that addmul_1/submul_1 always work in-place. Should
  be side-channel silent on the same machines where, e.g, mul_1 is
  side-channel silent, right?

Sure, these are often silent.  Where they are not, there will be leakage
problems anyway.

  > A similar situation is that addmul_1/submul_1 is sometimes faster than
  > addlsh_1/sublsh_1.

  And in that case, it would be nice with some configure magic to disable
  the lsh_1 functions and use addmul_1/submul_1 instead.

Indeed.

-- 
Torbjörn