Looking for advice on writing a high-speed, 64-bit, 2^n modulo q

Fri Mar 11 03:12:00 CET 2005

This is not another "GMP gives poor results for small numbers" but is
in a slightly similar spirit.  I'm preparing to do a very large
computation where I expect most of the time to be spent in computing
2^n modulo q for various odd q, where both "n" and "q" are unsigned
64-bit integers.  By "large computation" I mean that I hope that it
will take less than a year if I have several tens of modern
Pentium-or-Athlon processors working on it.  (Those are the kinds of
machines that are readily available to me.)

I've already been using an "mpz" implementation to do these
computations.  Recently I wrote an "mpn" version that seems to be
about 40% faster.  (See the attached file, pow2_mod_64bit-mpn-x86.c,
at the end of this e-mail.  I usually compile with "gcc -O3" but no
other optimizations.)  As an aside, I'd argue that although 40% is a
considerable speedup, the mpz version is not bad, especially
considering the small number of limbs involved.

In any case, I'm now considering rewriting the code in assembly
language, since I suspect that there's still a fair amount of overhead
in calling the mpn code (and in the mpn code having to count the
limbs, etc.)

Unfortunately, I'm far from an experienced assembly language
programmer these days.  (It was different 20 years ago!)  I can
certainly compile my "mpn" version to produce a ".s" file, and then
touch-up that ".s" file by hand to replace calls to the mpn functions
with inline assembly code, but I'm tempted to try to do better.

All of which leads to my question, which is a request for advice,
comments, and perhaps an offer to correspond with me outside this
mailing list to cover the many detailed questions that I have about
assembly language programming for Pentium-or-Athlon computers.

-- Regards, Will Galway
( mailto:galway at math.uiuc.edu  http://www.math.uiuc.edu/~galway )

-------------- next part --------------
/*
 * Time-stamp: <pow2_mod_64bit-mpn-x86.c: 10-Mar-2005 18:37:21 CST galway at math.uiuc.edu>
 *
 * Compute 2^n modulo q where q is a 64-bit unsigned integer, using
 * gmp's "mpn" functions and Peter Montgomery's "REDC" algorithm for
 * modular multiplication.
 *
 * This code is NOT machine independent: in particular, it assumes
 * that limbs (see gmp documentation) are 32-bits, and that the
 * machine is strictly "little endian" (so it's suitable for Intel x86
 * machines).
 *
 * Here's a capsule summary of Montgomery's algorithm which roughly
 * follows the presentation given in the book PRIME NUMBERS: A
 * COMPUTATIONAL PERSPECTIVE, by Richard Crandall and Carl Pomerance
 * (Crandall&Pomerance) -- see Section 9.2.1.  In the code below we
 * will have R=2^64 (two to the 64th power) and "q" will replace the
 * role of "N" in the summary given here.  The goal is to do
 * multiplication mod N without doing a lot of remaindering operations
 * mod N.
 *
 * DEFINITION: Given fixed integer R and N with gcd(R,N)=1, for any
 * integer z we define the "Montgomery representation of z" to be
 *  zbar := (z*R) mod N,
 * and we define the Montgomery product of two integers x,y to be
 *  M(x,y) : (x*y*R^(-1)) mod N.
 *
 * Note that if z=x*y then zbar = M(xbar,ybar).
 *
 * The basis of Montgomery's algorithm for computing M(x,y), and for
 * converting from the Montgomery representation to the usual
 * representation, is:
 * THEOREM: Given R and N as above, let Nhat = -N^(-1) mod R.
 * For any integer zbar, let
 *  Z = zbar + N*((zbar*Nhat) mod R),
 * then Z is divisible by R and z := Z/R = zbar*R^(-1) mod N.  Furthermore, if
 * 0 <= zbar < R*N then z - ((zbar*R^(-1)) mod N) is either 0 or N.
 */

/************************************************************************/

#include "gmp.h"
/* #define ASSERT(expr) to be a no-op if you don't have wfg-debug.h */
#include "wfg-debug.h"

#define FALSE 0
#define TRUE 1

/***************** Machine dependent type definitions ******************/
typedef long long int             Int64;
typedef unsigned long long int    Uns64;
typedef int                       Int32;
typedef unsigned int              Uns32;
typedef short                     Int16;
typedef unsigned short            Uns16;
typedef signed char               Int8;
typedef unsigned char             Uns8;
typedef unsigned char             Char8;
typedef unsigned char             Bool8;   /*  8 bit booleans */
/* Pointer to byte, sometimes a preferable alternate to (void *).  */
typedef char                     *BytePtr;

/* Must be >= 16 bits, but may vary for whatever is "optimal".   */
typedef unsigned int              Uns;
typedef int                       Int;     /* Preferred integer of >=16 bits.  */
typedef int                       Bool;    /* Preferred boolean.  */
/************************************************************************/

/*
 * Returns 2^n modulo q.  Must have q>1 and q odd.
 */
Uns64 pow2_mod_64bit(Uns64 n, Uns64 q)
{
  Uns64 rslt, qhat, one_rep, two_rep, tmp;
  Uns32 qhat32;
  Int bitcount;
  mp_limb_t tmp_128bit[4], tmp2_128bit[4];

  ASSERT(q>1 && (q&1)==1);
  if (n==0) {
    return 1;
  }

  /*
   * In the special case treated here, the "Montgomery representation
   * of m" means 2^64*m mod q.  one_rep and two_rep are the Montgomery
   * representations of 1 and 2, respectively.
   */
  one_rep = ((Uns64)(-1))%q + 1; /* == (2^64-1)%q + 1 == (2^64) mod q. */
  if (one_rep <= q>>1) {
    two_rep = 2*one_rep;
  }
  else {
    /* Double one_rep mod q, BARELY avoiding overflow.  */
    two_rep = one_rep + (one_rep-q);
  }

  /*
   * Use the Hensel-Newton algorithm to find qhat satisfying
   *   qhat == -1/q modulo 2^64.  -q serves as the inital (3-bit)
   * approximation to qhat.
   */
  qhat32 = -(Uns32)q;           /* 3 bits */
  qhat32 = qhat32*(2+qhat32*(Uns32)q); /* 6 bits */
  qhat32 = qhat32*(2+qhat32*(Uns32)q); /* 12 bits */
  qhat32 = qhat32*(2+qhat32*(Uns32)q); /* 24 bits */
  qhat32 = qhat32*(2+qhat32*(Uns32)q); /* 32 bits */
  qhat = qhat32*(2+qhat32*q);   /* 64 bits */
  ASSERT(q*qhat == -1);

  /*
   * Find most significant bit of n, start the squaring/doubling "power
   * ladder" loop.
   */
  bitcount = 64;
  while ((Int64)n > 0) {
    n <<= 1;
    bitcount--;
  }
  rslt = two_rep;
  while (--bitcount > 0) {
    /*
     * I'd use mpn_sqr_n(...), but there's no "public" interface to
     * that function.
     */
    mpn_mul_n(tmp_128bit, (mp_limb_t *)&rslt, (mp_limb_t *)&rslt, 2);
    /* Turn the square into the Montgomery representation mod q.  */
    tmp = qhat*(*(Uns64 *)tmp_128bit);
    mpn_mul_n(tmp2_128bit, (mp_limb_t *)&q, (mp_limb_t *)&tmp, 2);
    mpn_add_n(tmp_128bit,tmp_128bit, tmp2_128bit, 4);
    ASSERT(*(Uns64 *)tmp_128bit == 0);
    rslt = *(Uns64 *)&tmp_128bit[2];
    if (rslt >= q) {
      rslt -= q;
    }
    n <<= 1;
    if ((Int64)n < 0) {
      /* Double rslt mod q if corresponding bit is set in n.  */
      if (rslt <= q>>1) {
        rslt = 2*rslt;
      }
      else {
        rslt = rslt + (rslt-q);
      }
    }
  }
  /*
   * Finally, convert rslt from "Montgomery representation" to the
   * true value modulo q.
   */
  tmp = qhat*rslt;
  mpn_mul_n(tmp_128bit, (mp_limb_t *)&q, (mp_limb_t *)&tmp, 2);
  rslt = *(Uns64 *)&tmp_128bit[2];
  /*
   * By THEOREM, we know to add a carry to rslt if low 64-bits of
   * tmp_128bit is nonzero.
   */
  if (*(Uns64 *)&tmp_128bit[0] != 0) {
    rslt++;
  }
  /* Final reduction mod q.  */
  if (rslt >= q) {
    rslt -= q;
  }

  return rslt;
}

/*
 * The following code documents how to compute a 64-bit 2^n mod q
 * using mpz types.
 */
#if FALSE
/*
 * mpz_set_Uns64 is similar to mpz_set_ui(rslt, src).  rslt is of type
 * mpz_t, src of type Uns64.  Before calling this routine you must
 * initialize rslt via mpz_init (or some such).
 */
#define MPZ_SET_UNS64(rslt,src) {mpz_set_ui(rslt, src>>32); \
                                 mpz_mul_2exp(rslt, rslt, 32); \
                                  mpz_add_ui(rslt, rslt, (unsigned long int)src);}

/* Return 2^n modulo q using mpz functions.  */
Uns64 pow2_mod_64bit_mpz(Uns64 n, Uns64 q)
{
  mpz_t n_mpz, q_mpz, tmp_mpz, tmp2_mpz;
  Uns64 rslt;

  /* Initialize n_mpz, q_mpz, and tmp_mpz to hold 64-bit numbers (at first).  */
  mpz_init2(tmp_mpz,64);
  mpz_init2(n_mpz,64);
  MPZ_SET_UNS64(n_mpz,n);
  mpz_init2(q_mpz,64);
  MPZ_SET_UNS64(q_mpz,q);

  mpz_init_set_ui(tmp2_mpz, 2);

  /* tmp2_mpz = 2^n modulo q */
  mpz_powm(tmp2_mpz, tmp2_mpz, n_mpz, q_mpz);
  /* Convert the mpz result into "rslt".  */
  mpz_fdiv_q_2exp(tmp_mpz, tmp2_mpz, 32);
  rslt = (Uns64)mpz_get_ui(tmp2_mpz) + (((Uns64)mpz_get_ui(tmp_mpz))<<32);

  mpz_clear(n_mpz);
  mpz_clear(q_mpz);
  mpz_clear(tmp_mpz);
  mpz_clear(tmp2_mpz);
  return rslt;
}
#endif