Computing A mod d (for small odd d without division and multiplication)
Torbjörn Granlund
tg at gmplib.org
Sat Mar 14 13:40:30 CET 2026
I haven't had time to follow the discussion, but I am happy my
observation sparked some interest!
I suppose any inner loops will contain a few
a0 += ptr0[...]
a0c = carry_from_last_add
a1 += ptr1[...]
a1c = carry_from_last_add
..
where ptr0, ptr1, ... may be the same or different.
We should accumulate into more than one variable, and not make a chain
of carry dependencies. Even if d | B-1 we should accumulate onto 2 or
more variables.
Outside of the inner loop, things can be added together, as appropriate.
We would be able to come close to 0.5 cycles/limb with asm code for
modern x86 CPUs.
I am not super familiar with all the SIMD instructions of x86. That is
simply too messy to keep in ones head.
But if there are SIMD 64-bit addition which also allows for the
gathering of carry-out, then performance should get close to 0.25
cycles/limb.
ARM has such instructions.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list