ARM public key benchmark
Torbjorn Granlund
tg at gmplib.org
Tue Apr 2 20:12:03 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
nisse at lysator.liu.se (Niels Möller) writes:
> I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
> like to try.
That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
(and I always do in-place operation, so that works). Now, cnd_sub_n
beats submul_1 (except for n == 2, which I don't use):
$ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_submul_1.1 mpn_cnd_sub_n
clock_gettime is 1.000ns accurate
overhead 8.87 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
mpn_submul_1.1 mpn_cnd_sub_n
1 #19.8927 21.6831
2 #10.9752 12.4106
3 9.5514 #8.9371
4 8.5227 #6.6696
5 7.8316 #6.7412
6 7.1571 #6.0339
7 7.2859 #5.3320
8 6.8553 #4.8715
9 6.6945 #5.0376
10 6.3129 #4.8351
100 5.5065 #3.2110
This is perhaps not thanks for the speed of mpn_cnd_sub_n, but due to
mpn_submul_1 's slowness. I have a new A15 submul_1, I know of no A9
improvement.
But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes,
$ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_addmul_1.1 mpn_cnd_add_n
clock_gettime is 1.000ns accurate
overhead 8.94 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
mpn_addmul_1.1 mpn_cnd_add_n
1 #19.8927 21.2256
2 #10.8574 11.6940
3 #8.0235 8.5240
4 #6.4561 6.5216
5 #6.0308 6.5071
6 #5.4937 5.9282
7 #5.2063 5.3603
8 4.8838 #4.7493
9 #4.9249 4.9533
10 #4.5364 4.8244
100 3.4846 #3.2842
Not an alarming difference.
Some questions:
1. I guess one can expect submul_1 to always be a bit slower than
addmul_1, since submul_1 needs additional arithmetics besides the
umaal? One could perhaps do some negations on the fly, a - b C = -
((-a) + b*C), maybe that would be advantageous?
I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very
large difference between addmul_1 and submul_1.
2. cnd_add_n should be at least as fast as addmul_1, shouldn't it? It
appears to be 0.25 c/l faster for larger operands, so maybe it's "only"
a question of optimizing loop setup and feedin?
I suppose I've given addmul_1 much more attention. And the focus on any
cnd_ functions is side channel silence, not ultimate speed.
I've never considered addmul_1/submul_1 as alternatives to
cnd_add_n/cnd_sub_n. We might very well have cases where the former is
faster, as per http://gmplib.org/devel/asm.html. A similar situation is
that addmul_1/submul_1 is sometimes faster than addlsh_1/sublsh_1.
--
Torbjörn
More information about the gmp-devel
mailing list