Computing A mod d (for small odd d without division and multiplication)

Sun Mar 8 19:59:35 CET 2026

I am not sure this is new, but I don't think we have it in GMP.

We have some fast code in mpn for computing A mod d, but except for the
special mod_34lsub1, we need at least one limb multiply per limb in A.

Here is some code which just needs a load and a two-limb addition per
limb in A.

(There is a bunch of "% d" outside of the inner loop in this
implementation, but they should be using pre-inverse and then
multiplication.)

This algorithm is useful mainly for quite small d, or more precisely
when the number of limbs in A >> d.  This is because the rems[] vector
will else potentially be large and expensive to compute.

There are numbers d for which B^k = 1 for any k for our common B of 2^32
and 2^64.  Examples are 3, 5, 15, 17, 51, 85, 255, and 257.  These will
just use rems[0], and many other d values will have a "cycle length" of
just 3.

(If we use the "nails feature of GMP, B = 2^60 is interesting, as now
lots of d give short cycles.)

#include "gmp-impl.h"
#include "longlong.h"

// Compute A = mod d where A is {ap,n}.
mp_limb_t
mpn_mod_1s (const mp_limb_t* ap, mp_size_t n, mp_limb_t d)
{
  mp_limb_t rems[d];
  mp_limb_t q, r, t;

  r = ~(mp_limb_t)0 % d + 1;	// r = B mod d  (B is the limb base)

  // Compute B^t mod d for 0 <= t < k where k > 1 is the smallest number for
  // which B^k != 1.
  t = 1;
  int nrems = 0;
  do {
    rems[nrems] = t;
    ++nrems;
    t = t * r % d;
  } while (t != 1);
  // We now have B^0 mod d = 1, B^1 mod d, ..., B^(nrems-1) mod d. in rems[].

  // Sum every nrems limb into nrems sums, multiplying each sum by the
  // corresponding rems[].  Sum these product-sums into the variable acc.
  mp_limb_t acc = 0;
  mp_limb_t hi = 0;
  for (int i = 0; i < nrems; i++) {
    const mp_limb_t* tp = ap + i;

    mp_limb_t lo = hi;
    hi = 0;
    while (tp < ap + n) {
      add_ssaaaa (hi, lo, hi, lo, 0, *tp);
      tp += nrems;
    }

    acc += lo % d * rems[i];
  }
  acc += hi % d * rems[0];

  // Now acc contains a value which is congruent to A mod d.
  return acc;
}

On x86, it will be something like this:

1: add (%rdi), %rax
   adc $0, %rdx
   add %rsi, %rdi
   dec %rcx
   jnz 1b

as plain asm code.  4-way unrolling could do simply

1: add (%rdi), %rax
   adc $0, %rdx
   add (%rdi,%r8), %rax
   adc $0, %rdx
   add (%rdi,%r9), %rax
   adc $0, %rdx
   add (%rdi,%r10), %rax
   adc $0, %rdx
   add %rsi, %rdi
   dec %rcx
   jnz 1b

for 2 limbs/cycle on current x86 chips.  It could even use SIMD, presumably..

-- 
Torbjörn
Please encrypt, key id 0xC8601622