ARM public key benchmark
tg at gmplib.org
Wed Apr 3 14:05:47 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
> 1. I guess one can expect submul_1 to always be a bit slower than
> addmul_1, since submul_1 needs additional arithmetics besides the
> umaal? One could perhaps do some negations on the fly, a - b C = -
> ((-a) + b*C), maybe that would be advantageous?
> I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very
> large difference between addmul_1 and submul_1.
After some further thinking, it should work fine with one's complement
rather than two's complement for the negations,
a - b*C = ~(b*C + ~a) (if we do the complements on n+1 limbs)
So it should be doable with the addmul_1 loop and two additional,
non-recurrency, not instructions per limb, and then maybe some extra
logic for the return value. One could aim for 4.25 c/l, I guess.
Just send me the code. :-)
Have you considered complementing C instead?
> I've never considered addmul_1/submul_1 as alternatives to
But they are, except that addmul_1/submul_1 always work in-place. Should
be side-channel silent on the same machines where, e.g, mul_1 is
side-channel silent, right?
Sure, these are often silent. Where they are not, there will be leakage
> A similar situation is that addmul_1/submul_1 is sometimes faster than
And in that case, it would be nice with some configure magic to disable
the lsh_1 functions and use addmul_1/submul_1 instead.
More information about the gmp-devel