Computing A mod d (for small odd d without division and multiplication)
marco.bodrato at tutanota.com
marco.bodrato at tutanota.com
Sun Mar 15 14:30:02 CET 2026
Ciao Torbjörn,
14 mar 2026, 13:40 da tg at gmplib.org:
> I suppose any inner loops will contain a few
>
Actually, I did not touch the inner loop, I just simplified the outer one,
removing the unneeded rems[] array, and the unnecessary acc variable.
> We should accumulate into more than one variable, and not make a chain
> of carry dependencies. Even if d | B-1 we should accumulate onto 2 or
> more variables.
>
But since you asked, I tried the same code with two couples: hi, lo and hi2, lo2.
I attach a new code for measurements.
> We would be able to come close to 0.5 cycles/limb with asm code for
> modern x86 CPUs.
>
My assembly skills are rusty, and I don't have a very modern CPU,
but even the C code is fast enough to be faster than the current in some ranges.
When I sent the previous message, my "tune/speed" measurements where a bit
unfair, because I forgot to --disable-assert. So, here are some new numbers.
Let B=2^64,
Modulo 1023, we have B^5 = 1 mod 1023, a small order.
The new code is faster in a wide range.
n=1023; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.90 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
mpn_mod_1.1023 mpn_mod_1_a.1023 mpn_mod_1_b.1023
1 #41.1492 66.0706 212.5748
4 #21.3133 22.5360 54.3853
16 9.5578 #6.5047 13.7410
64 4.3763 #2.4725 4.4070
256 2.9760 #1.4629 1.9286
1024 2.6120 #1.2309 1.3610
4096 2.5214 #1.1956 1.2055
16384 2.4777 1.4472 #1.4380
65536 2.4924 #2.3086 2.3119
262144 2.4999 2.4564 #2.3847
1048576 #2.9904 4.7825 5.5658
Modulo 1019, we have B^509 = 1 mod 1019, a large order, but pre-computed.The new code is faster in a medium range.
$ n=1019; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 5.31 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.60 MHz
mpn_mod_1.1019 mpn_mod_1_a.1019 mpn_mod_1_b.1019
1 #65.4586 66.6597 20567.0892
4 #21.2112 22.5479 5122.3567
16 #9.5341 11.1123 1282.5043
64 #4.3281 8.9313 321.9570
256 #2.9928 8.4267 80.5561
1024 #2.5961 4.3699 20.6234
4096 2.5298 #1.7306 5.8876
16384 2.5030 #1.3313 2.4068
65536 2.4819 #2.0853 2.3196
262144 #2.5140 2.9740 2.8772
1048576 #2.9358 13.1306 13.3467
Modulo ,274177 we have B^2 = 1 mod 274177, a tiny order, not pre-computed.
Thanks to the 2-fold inner loop, one pass is enough, new code is slower only for very short operands.
n=274177; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.89 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
mpn_mod_1.274177 mpn_mod_1_a.274177 mpn_mod_1_b.274177
1 #41.4621 84.2615 112.6924
4 #21.1437 24.5606 28.6847
16 9.5385 #7.2031 7.8009
64 4.3266 #2.5133 2.7900
256 2.9750 #1.4814 1.5401
1024 2.5947 #1.1485 1.1647
4096 2.5020 #1.0829 1.0844
16384 2.4949 1.1732 #1.1730
65536 2.4782 1.3068 #1.3067
262144 2.5215 1.3889 #1.3265
1048576 2.9439 #1.8079 1.8437
Modulo ,274173 we have B^11223 = 1 mod 274173, a huge order, not pre-computed.
n=274173; build/tune/speed-ext -p10000000 -C -s 1-2000000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.90 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
mpn_mod_1.274173 mpn_mod_1_a.274173 mpn_mod_1_b.274173
1 #41.6464 399910.2857 450893.7600
4 #21.0509 100079.7143 112613.4400
16 #9.5381 24984.6786 28120.3900
64 #4.3558 6317.3616 7047.6125
256 #2.9755 1569.1378 1758.5638
1024 #2.6136 398.5261 440.2130
4096 #2.5164 106.4404 111.6505
16384 #2.5371 30.2224 28.1599
65536 #2.4840 8.1727 7.8289
262144 #2.5287 3.0757 3.0168
1048576 #2.9270 6.6199 6.3925
The time needed to initialize the computation, and the effect of cache missis change a lot
for different bases, not far from one another. Not only to use this strategy we have to write
an efficient inner-loop, but we also have to think how to handle "thresholds"...
Moreover, I did not update the ASSERT, for the mpn_mod_1_a variant, my implementation assumes that r*r*r fits in a limb.
> But if there are SIMD 64-bit addition which also allows for the
> gathering of carry-out, then performance should get close to 0.25
> cycles/limb.
>
> ARM has such instructions.
>
Does ARM have SIMD 64-bits addition with carry? Really? Interesting!
More information about the gmp-devel
mailing list