Computing A mod d (for small odd d without division and multiplication)

marco.bodrato at tutanota.com marco.bodrato at tutanota.com
Sun Mar 15 14:30:02 CET 2026


Ciao Torbjörn,

14 mar 2026, 13:40 da tg at gmplib.org:

> I suppose any inner loops will contain a few
>
Actually, I did not touch the inner loop, I just simplified the outer one,
removing the unneeded rems[] array, and the unnecessary acc variable.

> We should accumulate into more than one variable, and not make a chain
> of carry dependencies.  Even if d | B-1 we should accumulate onto 2 or
> more variables.
>
But since you asked, I tried the same code with two couples: hi, lo and hi2, lo2.
I attach a new code for measurements.
> We would be able to come close to 0.5 cycles/limb with asm code for
> modern x86 CPUs.
>
My assembly skills are rusty, and I don't have a very modern CPU,
but even the C code is fast enough to be faster than the current in some ranges.

When I sent the previous message, my "tune/speed" measurements where a bit
unfair, because I forgot to --disable-assert. So, here are some new numbers.
Let B=2^64,
Modulo 1023, we have B^5 = 1 mod 1023, a small order.
The new code is faster in a wide range.
n=1023; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.90 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
        mpn_mod_1.1023 mpn_mod_1_a.1023 mpn_mod_1_b.1023
1            #41.1492       66.0706      212.5748
4            #21.3133       22.5360       54.3853
16             9.5578       #6.5047       13.7410
64             4.3763       #2.4725        4.4070
256            2.9760       #1.4629        1.9286
1024           2.6120       #1.2309        1.3610
4096           2.5214       #1.1956        1.2055
16384          2.4777        1.4472       #1.4380
65536          2.4924       #2.3086        2.3119
262144         2.4999        2.4564       #2.3847
1048576        #2.9904        4.7825        5.5658

Modulo 1019, we have B^509 = 1 mod 1019, a large order, but pre-computed.The new code is faster in a medium range.
$ n=1019; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 5.31 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.60 MHz
        mpn_mod_1.1019 mpn_mod_1_a.1019 mpn_mod_1_b.1019
1            #65.4586       66.6597    20567.0892
4            #21.2112       22.5479     5122.3567
16            #9.5341       11.1123     1282.5043
64            #4.3281        8.9313      321.9570
256           #2.9928        8.4267       80.5561
1024          #2.5961        4.3699       20.6234
4096           2.5298       #1.7306        5.8876
16384          2.5030       #1.3313        2.4068
65536          2.4819       #2.0853        2.3196
262144        #2.5140        2.9740        2.8772
1048576        #2.9358       13.1306       13.3467

Modulo ,274177 we have B^2 = 1 mod 274177, a tiny order, not pre-computed.
Thanks to the 2-fold inner loop, one pass is enough, new code is slower only for very short operands.
n=274177; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.89 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
        mpn_mod_1.274177 mpn_mod_1_a.274177 mpn_mod_1_b.274177
1            #41.4621       84.2615      112.6924
4            #21.1437       24.5606       28.6847
16             9.5385       #7.2031        7.8009
64             4.3266       #2.5133        2.7900
256            2.9750       #1.4814        1.5401
1024           2.5947       #1.1485        1.1647
4096           2.5020       #1.0829        1.0844
16384          2.4949        1.1732       #1.1730
65536          2.4782        1.3068       #1.3067
262144         2.5215        1.3889       #1.3265
1048576         2.9439       #1.8079        1.8437

Modulo ,274173 we have B^11223 = 1 mod 274173, a huge order, not pre-computed.

n=274173; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.90 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
        mpn_mod_1.274173 mpn_mod_1_a.274173 mpn_mod_1_b.274173
1            #41.6464   399910.2857   450893.7600
4            #21.0509   100079.7143   112613.4400
16            #9.5381    24984.6786    28120.3900
64            #4.3558     6317.3616     7047.6125
256           #2.9755     1569.1378     1758.5638
1024          #2.6136      398.5261      440.2130
4096          #2.5164      106.4404      111.6505
16384         #2.5371       30.2224       28.1599
65536         #2.4840        8.1727        7.8289
262144        #2.5287        3.0757        3.0168
1048576        #2.9270        6.6199        6.3925

The time needed to initialize the computation, and the effect of cache missis change a lot
for different bases, not far from one another. Not only to use this strategy we have to write
an efficient inner-loop, but we also have to think how to handle "thresholds"...

Moreover, I did not update the ASSERT, for the mpn_mod_1_a variant, my implementation assumes that r*r*r fits in a limb.
> But if there are SIMD 64-bit addition which also allows for the
> gathering of carry-out, then performance should get close to 0.25
> cycles/limb.
>
> ARM has such instructions.
>
Does ARM have SIMD 64-bits addition with carry? Really? Interesting!


More information about the gmp-devel mailing list