Computing A mod d (for small odd d without division and multiplication)
Torbjörn Granlund
tg at gmplib.org
Sun Mar 8 19:59:35 CET 2026
I am not sure this is new, but I don't think we have it in GMP.
We have some fast code in mpn for computing A mod d, but except for the
special mod_34lsub1, we need at least one limb multiply per limb in A.
Here is some code which just needs a load and a two-limb addition per
limb in A.
(There is a bunch of "% d" outside of the inner loop in this
implementation, but they should be using pre-inverse and then
multiplication.)
This algorithm is useful mainly for quite small d, or more precisely
when the number of limbs in A >> d. This is because the rems[] vector
will else potentially be large and expensive to compute.
There are numbers d for which B^k = 1 for any k for our common B of 2^32
and 2^64. Examples are 3, 5, 15, 17, 51, 85, 255, and 257. These will
just use rems[0], and many other d values will have a "cycle length" of
just 3.
(If we use the "nails feature of GMP, B = 2^60 is interesting, as now
lots of d give short cycles.)
#include "gmp-impl.h"
#include "longlong.h"
// Compute A = mod d where A is {ap,n}.
mp_limb_t
mpn_mod_1s (const mp_limb_t* ap, mp_size_t n, mp_limb_t d)
{
mp_limb_t rems[d];
mp_limb_t q, r, t;
r = ~(mp_limb_t)0 % d + 1; // r = B mod d (B is the limb base)
// Compute B^t mod d for 0 <= t < k where k > 1 is the smallest number for
// which B^k != 1.
t = 1;
int nrems = 0;
do {
rems[nrems] = t;
++nrems;
t = t * r % d;
} while (t != 1);
// We now have B^0 mod d = 1, B^1 mod d, ..., B^(nrems-1) mod d. in rems[].
// Sum every nrems limb into nrems sums, multiplying each sum by the
// corresponding rems[]. Sum these product-sums into the variable acc.
mp_limb_t acc = 0;
mp_limb_t hi = 0;
for (int i = 0; i < nrems; i++) {
const mp_limb_t* tp = ap + i;
mp_limb_t lo = hi;
hi = 0;
while (tp < ap + n) {
add_ssaaaa (hi, lo, hi, lo, 0, *tp);
tp += nrems;
}
acc += lo % d * rems[i];
}
acc += hi % d * rems[0];
// Now acc contains a value which is congruent to A mod d.
return acc;
}
On x86, it will be something like this:
1: add (%rdi), %rax
adc $0, %rdx
add %rsi, %rdi
dec %rcx
jnz 1b
as plain asm code. 4-way unrolling could do simply
1: add (%rdi), %rax
adc $0, %rdx
add (%rdi,%r8), %rax
adc $0, %rdx
add (%rdi,%r9), %rax
adc $0, %rdx
add (%rdi,%r10), %rax
adc $0, %rdx
add %rsi, %rdi
dec %rcx
jnz 1b
for 2 limbs/cycle on current x86 chips. It could even use SIMD, presumably..
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list