Reason for definition of precomputed reciprocals
Albin Ahlbäck
ahlback at lix.polytechnique.fr
Thu Jun 25 15:40:09 CEST 2026
Thanks for your answer, Niels!
On 25/06/2026 14:49, Niels Möller wrote:
> I'm not sure what you are after, do you have some improvement to the
> algorithm?
Hopefully. Not particularly for the 1-limb case as the trick I want to
use comes into play at bigger sizes. For now, the 1-limb case only has
about 8% lower latency on Zen 4, which is probably because I rely on the
pretty fast div instruction that's on modern CPUs.
For the 4-limb case, my routine has ~39% lower latency compared to
invert_pi1 + sbpi1_div_q on Zen 4. But my routines are hardcoded with
respect to the size, so some gain is definitely expected from this.
> Using v = floor((B^2 - 1) / d) - B implies that
>
> (B + v) d = B^2 - k, with 1 <= k <= d
> ...> With the alternative smaller reciprocal you suggest, it seems that
> instead one would get
>
> (B + v) d = B^2 - k with d <= k < 2d
One has
(B + 1 + v) d = B^2 - k,
where 0 <= k < d. For division, one would have
q = floor(n (v + B + 1) / B^2),
r = m - q d,
with 0 <= r < 2 d, so only if-switch would be required. I'm sure that
your division with precomputed reciprocal is faster, I just wanted to
understand your perspective on it.
Best,
Albin
More information about the gmp-devel
mailing list