Computing A mod d (for small odd d without division and multiplication)

Torbjörn Granlund tg at gmplib.org
Sat Mar 14 13:40:30 CET 2026


I haven't had time to follow the discussion, but I am happy my
observation sparked some interest!

I suppose any inner loops will contain a few

  a0 += ptr0[...]
  a0c = carry_from_last_add
  a1 += ptr1[...]
  a1c = carry_from_last_add
  ..

where ptr0, ptr1, ...  may be the same or different.

We should accumulate into more than one variable, and not make a chain
of carry dependencies.  Even if d | B-1 we should accumulate onto 2 or
more variables.

Outside of the inner loop, things can be added together, as appropriate.

We would be able to come close to 0.5 cycles/limb with asm code for
modern x86 CPUs.

I am not super familiar with all the SIMD instructions of x86.  That is
simply too messy to keep in ones head.

But if there are SIMD 64-bit addition which also allows for the
gathering of carry-out, then performance should get close to 0.25
cycles/limb.

ARM has such instructions.


-- 
Torbjörn
Please encrypt, key id 0xC8601622


More information about the gmp-devel mailing list