Reason for definition of precomputed reciprocals

Thu Jun 25 15:40:09 CEST 2026

Thanks for your answer, Niels!

On 25/06/2026 14:49, Niels Möller wrote:
> I'm not sure what you are after, do you have some improvement to the
> algorithm?

Hopefully.  Not particularly for the 1-limb case as the trick I want to 
use comes into play at bigger sizes.  For now, the 1-limb case only has 
about 8% lower latency on Zen 4, which is probably because I rely on the 
pretty fast div instruction that's on modern CPUs.

For the 4-limb case, my routine has ~39% lower latency compared to 
invert_pi1 + sbpi1_div_q on Zen 4.  But my routines are hardcoded with 
respect to the size, so some gain is definitely expected from this.

> Using v = floor((B^2 - 1) / d) - B implies that
> 
>   (B + v) d = B^2 - k, with 1 <= k <= d
> ...> With the alternative smaller reciprocal you suggest, it seems that
> instead one would get
> 
>    (B + v) d = B^2 - k with d <= k < 2d
One has

	(B + 1 + v) d = B^2 - k,

where 0 <= k < d.  For division, one would have

	q = floor(n (v + B + 1) / B^2),
	r = m - q d,

with 0 <= r < 2 d, so only if-switch would be required.  I'm sure that 
your division with precomputed reciprocal is faster, I just wanted to 
understand your perspective on it.

Best,
Albin