ARM public key benchmark

Tue Apr 2 20:12:03 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

  nisse at lysator.liu.se (Niels Möller) writes:

  > I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd
  > like to try.

  That wasn't a clear win... I use addmul_1 and submul_1 as a fallback
  (and I always do in-place operation, so that works). Now, cnd_sub_n
  beats submul_1 (except for n == 2, which I don't use):

  $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_submul_1.1 mpn_cnd_sub_n
  clock_gettime is 1.000ns accurate
  overhead 8.87 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
          mpn_submul_1.1 mpn_cnd_sub_n
  1            #19.8927       21.6831
  2            #10.9752       12.4106
  3              9.5514       #8.9371
  4              8.5227       #6.6696
  5              7.8316       #6.7412
  6              7.1571       #6.0339
  7              7.2859       #5.3320
  8              6.8553       #4.8715
  9              6.6945       #5.0376
  10             6.3129       #4.8351
  100            5.5065       #3.2110

This is perhaps not thanks for the speed of mpn_cnd_sub_n, but due to
mpn_submul_1 's slowness.  I have a new A15 submul_1, I know of no A9
improvement.

  But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes,

  $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_addmul_1.1 mpn_cnd_add_n
  clock_gettime is 1.000ns accurate
  overhead 8.94 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz
          mpn_addmul_1.1 mpn_cnd_add_n
  1            #19.8927       21.2256
  2            #10.8574       11.6940
  3             #8.0235        8.5240
  4             #6.4561        6.5216
  5             #6.0308        6.5071
  6             #5.4937        5.9282
  7             #5.2063        5.3603
  8              4.8838       #4.7493
  9             #4.9249        4.9533
  10            #4.5364        4.8244
  100            3.4846       #3.2842

Not an alarming difference.

  Some questions:

  1. I guess one can expect submul_1 to always be a bit slower than
     addmul_1, since submul_1 needs additional arithmetics besides the
     umaal? One could perhaps do some negations on the fly, a - b C = -
     ((-a) + b*C), maybe that would be advantageous?

I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very
large difference between addmul_1 and submul_1.

  2. cnd_add_n should be at least as fast as addmul_1, shouldn't it? It
     appears to be 0.25 c/l faster for larger operands, so maybe it's "only"
     a question of optimizing loop setup and feedin?

I suppose I've given addmul_1 much more attention.  And the focus on any
cnd_ functions is side channel silence, not ultimate speed.

I've never considered addmul_1/submul_1 as alternatives to
cnd_add_n/cnd_sub_n.  We might very well have cases where the former is
faster, as per http://gmplib.org/devel/asm.html.  A similar situation is
that addmul_1/submul_1 is sometimes faster than addlsh_1/sublsh_1.

-- 
Torbjörn