Performance of modular exponentiation

Tue Dec 8 13:33:00 UTC 2020

Ivo maffei <ivomaffei at gmail.com> writes:

  I understand the high-level algorithms used i.e. sliding window, Toom
  multiplication and Montgomery reduction.

A good start, then!

  If I understand the source code correctly, the sliding window “core”
  is in the macro INNERLOOP in mpn_powm.

Indeed.

  Mpz_powm contains a lot more code that confuses me a little.

  I would like to know if there are _practical_ optimisations that are
  meant to substantially improve performance.
  The chapter 15.8 "Assembly Code” list quite a few such techniques. Is
  there something done at the C level?

The idea with that INNERLOOP mess is to move lots of foldable
expressions from the innerloop.  That helps reduce overhead, great for
small operands.  (I think we should combine some of the INNERLOOP
invocations for larger operands, as overhead there is minimal.)

Much of GMP's performance comes from its assembly code.  It is not just
any assembly, written with some assumtion that "assembly ia fast".  Many
functions comes tailored for many different pipelines, and the code was
semi-automatically re-scheduled with a randomised loop optimiser tool.

-- 
Torbjörn
Please encrypt, key id 0xC8601622