ARM public key benchmark

Torbjorn Granlund tg at
Wed Apr 3 14:11:50 CEST 2013

nisse at (Niels Möller) writes:

  For large operands, it's strictly between add_n and addmul_1, which I
  guess is as expected. For small sizes, I had a look at the loop setup
  for add_n, which checks bit 0 and 1 of n separately. If that's faster,
  maybe one could borrow that logic.
Let me know if you get any improvement from that trick.  (And please
watch for slowdown on A15!)

  I also wonder if there are any other tricks to speed up cnd_add_n. As
  far as I understand, shift operations on arm don't truncate shift counts
  to 5 bits (0-31), so one could perhaps replace
     bic	b, b, cnd		C zero for true, all ones for false
     adcs	r, a, b
     adcs	r, a, b, lsl cnd	C zero for true, 32 for false
  (If we believe that timing and internal dependencies are independent of
  the shift count). I played a little with that, but I get no speed
  improvement so far.
That is a clever trick, but I would not be sure ARM chips execute them
at the same cnd-agnostic way.

I suspect all current chips execute these combinaton instructions as if
they were two instructions.


More information about the gmp-devel mailing list