Small-base powm

Tue Mar 25 10:14:35 UTC 2014

I've been thinking a bit about how to handle powm with a single limb
base. In particular, the base 2 seems to be a popular choice for
primality testing.

Now, all algorithms do bitsize(e) modular squarings (ok, not all; with a
fix base precomputation as in Pippinger's algorithm
http://cr.yp.to/papers.html#pippenger, one can get a large speedup by
reducing the number of squarings. Not sure whether or not that's a good
idea for the case of a small base which isn't invariant).

So optimizatinos apply mainly to the modular multiplications. Comments
in mpn/generic/powm.c say

   * When U (the base) is small, we should start the exponentiation with plain
     operations, then convert that partial result to REDC form.

Consider first the simplest algorithm, left-to-right binary, doing one
exponent bit at a time. Like,

  R <-- 1

  loop over bits of E, most significant first 

    R <-- R^2 (mod M)
    if (current E bit set)
      R <-- u R (mod M)

   R <-- R mod M    // canonicalize result,

Let n be the limb count of M. Since R starts small, initially keep track
if its size, and as long as its size <= n, don't do any reductions.

When the size of R exceeds n, it should be reduced modulo M. But there's
no need to do a canonical reduction, we can allow any representative
which fits in n limbs.

For the squarings, we should use the same montgomery reduction as in the
current code (except that maybe one could do something more clever for
the first squaring with a result exceeding n in size, where the size of
the unreduced square is in the interval n < rn < 2n).

For the modular multiplies, there's no reason to use montgomery
multiplication, since that makes the small base u explode to an n-limb
number. We get a product of n+1 limbs, and we need to reduce it to n
limbs. I think this could be done with a addmul_1 or submul_1, and a
single and unlikely adjustment step (or maybe it can be arrange with no
adjustment step at all)?

Each modular multiply clearly is O(n). There may be no reason to do
anything more sophisticated then the left-to-right binary, since
squarings will likely totally dominate the running time. But there are
some possibilities. 

* One could find the largest k such that u^{2^k-1} fits in a single
  limb, and process k exponent bits at a time.

* One could handle powers of 2 specially, with shifts instead of
  multiplies.

* Maybe one could extend this to take advantage of addmul_2, if present.

* In general, for "small" U of size somewhere between a single limb, and
  full n limbs, there's got to be some threshold between using euclidean
  reduction of less than n limbs, to montgomery reduction of n limbs.

* Maybe one shouldn't do the n+1 to n reduction as a separate step, but
  somehow combine it with a montgomery reduction. E.g., one could
  probably write a pretty efficient function which computes u R B^{-1}
  (mod M) as an n-limb number. We then get some additional power of
  B^{-1} (mod M) into the result, but maybe they can be handled
  efficiently.

And if we do additional powm functions, with single limb modulo (and
possibly large exponent, since reducing it requires the factorization of
m), or single-limb exponent, or single-limb base, we get another
challenge in naming all the functions. ;-)

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.