Computing A mod d (for small odd d without division and multiplication)

marco.bodrato at tutanota.com marco.bodrato at tutanota.com
Sat Mar 14 11:18:11 CET 2026


Ciao,

9 mar 2026, 17:58 da marco.bodrato at tutanota.com:

> Do we need all the value in rems[]? I believe we don't.
>
> Instead of accumulating in acc, we can multiply lo by the inverse of B mod d,
> and add hi to it, then we can keep on accumulating.
> At the end of the cycle, we will have in [hi,lo] a number equivalent to A mod d.
>

We can also avoid to compute the inverse of B mod b, and directly use the residue
of B mod b. We just have to add up the lines by the reverse order.

I tested both approaches and the return value is correct. Is it fast?

I know that using tune/speed-ext.c is not the best way to check the actual speed.
but it's the easiest :-) That's why I attach a file with the two functions that can
replace that file...

Everything depends on the (multiplicative) order of B in the ring Z/bZ.
If the order is 1, then the residue of B mod b is 1, and its inverse is 1
(i.e. b divides B-1) and both codes are quite fast, and of course one could
write an even more specialized code for that case.

An example, if b=17:
$ n=17; build/tune/speed-ext -p10000000 -C -s 1-1700000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.89 cycles, precision 10000000 units of 7.71e-10 secs, CPU freq 1297.01 MHz
         mpn_mod_1.17 mpn_mod_1_a.17 mpn_mod_1_b.17
1            #64.7109      144.3273      167.0704
4            #20.9006       37.0905       42.7984
16            #9.5543       10.6102       11.8468
64             4.3204       #4.2093        4.3162
256            2.9799        2.1959       #2.0831
1024           2.5893        1.6899       #1.4976
4096           2.5008        1.5674       #1.3904
16384          2.4772        1.5701       #1.3881
65536          2.4775        1.6103       #1.4673
262144         2.5024        1.6479       #1.5027
1048576         2.9615        2.1354       #2.0525

If the base is small, so that the order is small, we can look-up the order from a table,
and it is easy to compute the residue of B mod b, to compute its inverse we need
order multiplications and reductions (asymptotically we'd use gcdext, but here we don't).

Assume order is small but not tiny, e.g. if b=417, the order is 69,
computing the inverse is heavy, so that the code not needing it is faster
in a range of lengths.
But there one more problem, swiping the operand many times is not efficient,
if the operand does not fit in the cache! 
$ n=417; build/tune/speed-ext -p10000000 -C -s 1-1700000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.89 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.00 MHz
        mpn_mod_1.417 mpn_mod_1_a.417 mpn_mod_1_b.417
1            #64.8495      146.5563     2822.5922
4            #20.9894       42.9108      714.6674
16            #9.5659       17.4111      180.0995
64            #4.3715       12.2414       45.2863
256           #2.9932        4.2965       12.2776
1024           2.5905       #2.5338        4.0825
4096           2.5009       #2.2388        2.4175
16384          2.4773        1.7700       #1.6322
65536         #2.4769        2.7296        3.0141
262144        #2.5216        4.0898        4.2677
1048576        #2.9732        8.0419        9.4231

If the order is tiny, even if it is not exactly 1, we need to run over the operand only a few times.
The effect of cache misses is smaller, but at last it is heavy anyway.

An example, if b=274177, the order is 2:
$ n=274177; build/tune/speed-ext -p10000000 -C -s 1-1700000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.91 cycles, precision 10000000 units of 7.71e-10 secs, CPU freq 1296.95 MHz
        mpn_mod_1.274177 mpn_mod_1_a.274177 mpn_mod_1_b.274177
1            #64.9447      169.8672      266.9416
4            #27.8788       45.2729       50.4220
16            #9.6185       12.4918       13.5013
64            #4.3203        4.6472        5.0284
256            2.9716        2.4373       #2.3494
1024           2.5923        1.7532       #1.5798
4096           2.5010        1.5898       #1.4298
16384          2.5057        1.7232       #1.5530
65536          2.4807        2.0696       #2.0327
262144         2.5032        2.1419       #2.0908
1048576        #2.9749        3.3043        3.2164

On the other side , if b=274175, the order is 415, and the new codes are both inefficient:
$ n=274175; build/tune/speed-ext -p10000000 -C -s 1-1700000 -f4 mpn_mod_1.$n mpn_mod_1_a.$n mpn_mod_1_b.$n
overhead 4.89 cycles, precision 10000000 units of 5.28e-10 secs, CPU freq 1895.61 MHz
        mpn_mod_1.274175 mpn_mod_1_a.274175 mpn_mod_1_b.274175
1            #64.7279    14927.8659    16765.8239
4            #20.9373     3736.0827     4192.1050
16            #9.5638      941.4242     1051.1517
64            #4.3743      243.4748      263.8422
256           #2.9712       68.3601       66.6708
1024          #2.6175       19.7176       17.5073
4096          #2.5028        6.2934        5.2420
16384         #2.4967        3.4427        2.8903
65536         #2.4779        2.7415        2.7957
262144        #2.4965        3.3008        3.6258
1048576        #2.9634       20.7607       20.7154

Ĝis,
mb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed-ext.c
Type: text/x-csrc
Size: 12785 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20260314/153d483c/attachment-0001.bin>


More information about the gmp-devel mailing list