Reason for definition of precomputed reciprocals

Thu Jun 25 16:41:50 CEST 2026

Albin Ahlbäck <ahlback at lix.polytechnique.fr> writes:

> One has
>
> 	(B + 1 + v) d = B^2 - k,
>
> where 0 <= k < d.  For division, one would have
>
> 	q = floor(n (v + B + 1) / B^2),
> 	r = m - q d,
>
> with 0 <= r < 2 d, so only if-switch would be required.

But then it's possible that r >= B, so the needed test for r >= d is not
a simple single-word comparison. The main point of our algorithms for
2/1 and 3/2 division is to only have to compute the low half of q d.
This avoids a cycle of carry propagation latency when computing the
candidate r, as well as the cost of computing the high half of the q * d
product. And instead, the "q_0" fraction word is needed to sort things
out.

That's a tradeoff that seems pretty good in the GMP setting. For other
settings, there are likely better ways. E.g., I've recently been looking
into "post-quantum" lattice-based algorithms, and need modulo for some
fixed small d's. Then I settled on using the reciprocal

  v = ceil (2^32 / d)

and I can then compute x mod d (where x is 32 bits and d is 16 bits)
using something like

  uint32_t q = ((uint64_t) v * x) >> 32 
  uint32_t r = u - q*d; /* Interpreted as two's complement, |r| < d */
  r += (r >> 16) & d;

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.