Fast constant-time gcd computation and modular inversion

Wed Aug 31 10:26:35 CEST 2022

  >     count = (49 * bits + 57) / 17;
  >
  > Odd.

  For sure. This isn't based on local progress of the algorithm (there
  ain't no guaranteed progress for a short sequence of reduction steps),
  but based on rather complex analysis of the number of steps needed for
  the complete 2-adic quotient sequence.

That count is a proven "upper bound" of the numbver of iterations.  Does
the paper discuss how tight it is, i.e., is it close to a "lower bound"
of the required number of iterations.

  > I think your measurements are a bit optimistic, though.  My measruments
  > from slightly modified timing code suggest 4.5x slowdown, and more for
  > really small operands.

  Maybe, I tried to keep the timing really simple and portable.

I simply up'ed the number of iterations (actually made them depend in
the operand size).

  I've now implemented inverse too. See updated code below. When I run it,
  I get

  invert size 1, ref 0.293 (us), djb 1.145 (us)
  invert size 2, ref 0.721 (us), djb 1.659 (us)
  invert size 3, ref 0.994 (us), djb 2.375 (us)
  [...]
  invert size 17, ref 5.157 (us), djb 18.538 (us)
  invert size 18, ref 5.207 (us), djb 19.064 (us)
  invert size 19, ref 5.702 (us), djb 21.286 (us)

  so a slowdown of 3--4 compared to mpz_invert. This timing excludes the
  precomputation of a few needed per-modulo constants.

Very very nice!

People not famililiar with what we're trying to do here might be
unimpressed with these results.  "Look, I've written new code which is 4
times slower than what we have, HURRAY!"

Comparing to the existing side-channel-silent inversion code would 
impress more, I think.

  I think the inverse flavor is still rather neat, main loop is

    for (delta = 1;count > STEPS;count -= STEPS)
      {
        delta = steps (&M, STEPS, delta, fp[0], gp[0]);
        matrix_vector_mul (&M, STEPS, n+1, fp, gp, tp);
        matrix_vector_mul_mod (&M, Bp, n+2, up, vp, tp);
      }

I can make it even neater:

    for (delta = 1;count > STEPS;count -= STEPS)
      {
        do_it_all (fp, gp, up, vp, n);
      }

:-)

  I'm thinking I'll try to implement and benchmark this for Nettle's ecc
  functions first, before attempting to update GMP function. The reason is
  that (i) I really want to do needed precomputations for all moduli of
  interest for Nettle at compile time, which would complicate the GMP
  interface if it is supported at all, and (ii) in some cases I want
  inversion to operate on numbers in montgomery representation, and then
  I'd like to fold any needed additional power of two into the precomputed
  constant.

What percentage of Ec scalar mul is used for the final inversion?  Not
much, I suppose.  Does Nettle use exponentiation there today, or
mpn_sec_invert?

-- 
Torbjörn
Please encrypt, key id 0xC8601622