ARM public key benchmark

Tue Apr 2 15:14:18 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

> I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
> like to try.

That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
(and I always do in-place operation, so that works). Now, cnd_sub_n
beats submul_1 (except for n == 2, which I don't use):

$ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_submul_1.1 mpn_cnd_sub_n
clock_gettime is 1.000ns accurate
overhead 8.87 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
        mpn_submul_1.1 mpn_cnd_sub_n
1            #19.8927       21.6831
2            #10.9752       12.4106
3              9.5514       #8.9371
4              8.5227       #6.6696
5              7.8316       #6.7412
6              7.1571       #6.0339
7              7.2859       #5.3320
8              6.8553       #4.8715
9              6.6945       #5.0376
10             6.3129       #4.8351
100            5.5065       #3.2110

But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes,

$ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_addmul_1.1 mpn_cnd_add_n
clock_gettime is 1.000ns accurate
overhead 8.94 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
        mpn_addmul_1.1 mpn_cnd_add_n
1            #19.8927       21.2256
2            #10.8574       11.6940
3             #8.0235        8.5240
4             #6.4561        6.5216
5             #6.0308        6.5071
6             #5.4937        5.9282
7             #5.2063        5.3603
8              4.8838       #4.7493
9             #4.9249        4.9533
10            #4.5364        4.8244
100            3.4846       #3.2842

Some questions:

1. I guess one can expect submul_1 to always be a bit slower than
   addmul_1, since submul_1 needs additional arithmetics besides the
   umaal? One could perhaps do some negations on the fly, a - b C = -
   ((-a) + b*C), maybe that would be advantageous?

2. cnd_add_n should be at least as fast as addmul_1, shouldn't it? It
   appears to be 0.25 c/l faster for larger operands, so maybe it's "only"
   a question of optimizing loop setup and feedin?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.