ARM public key benchmark

Wed Apr 3 09:29:19 CEST 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:

>   But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes,
>   
>   6             #5.4937        5.9282
>
> Not an alarming difference.

Maybe not, but I got a measurable slowdown of some ECC operations when
switching to mpn_cnd_add_n, and my best guess is that this is the reason
for that.

>   1. I guess one can expect submul_1 to always be a bit slower than
>      addmul_1, since submul_1 needs additional arithmetics besides the
>      umaal? One could perhaps do some negations on the fly, a - b C = -
>      ((-a) + b*C), maybe that would be advantageous?
>   
> I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very
> large difference between addmul_1 and submul_1.

After some further thinking, it should work fine with one's complement
rather than two's complement for the negations,

  a - b*C = ~(b*C + ~a)  (if we do the complements on n+1 limbs)

So it should be doable with the addmul_1 loop and two additional,
non-recurrency, not instructions per limb, and then maybe some extra
logic for the return value. One could aim for 4.25 c/l, I guess.

> I've never considered addmul_1/submul_1 as alternatives to
> cnd_add_n/cnd_sub_n.

But they are, except that addmul_1/submul_1 always work in-place. Should
be side-channel silent on the same machines where, e.g, mul_1 is
side-channel silent, right?

> A similar situation is that addmul_1/submul_1 is sometimes faster than
> addlsh_1/sublsh_1.

And in that case, it would be nice with some configure magic to disable
the lsh_1 functions and use addmul_1/submul_1 instead.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.