Improved binvert algorithm
Niels Möller
nisse at lysator.liu.se
Sun Feb 8 20:13:06 CET 2026
David Sparks <sparks05 at proton.me> writes:
> __attribute__((const))
> uint64_t binvert_64(uint64_t n)
> {
> uint64_t r = 3*n ^ 2; // 5 bits
> uint64_t y = 1 - n*r;
> r *= 1+y; // 10 bits
> y *= y;
> r *= 1+y; // 20 bits
> y *= y;
> r *= 1+y; // 40 bits
> y *= y;
> r *= 1+y; // 64 bits
> return r;
> }
Interesting! To me, the clever thing is that the sequence of y values is
computed independently of r, with only a single multiply latency. And
the series of r-values is done in parallel, also with only a single
multiply latency per step. And then the latency of the 1+y operation is
only on the y --> r connection, so it is only paid once.
Tried this out, with this replacement in gmp-impl.h:
#define binvert_limb(inv,n) \
do { \
mp_limb_t __n = (n); \
mp_limb_t __inv = 3*__n ^ 2; \
mp_limb_t __y = 1 - __n * __inv; \
__inv *= (1 + __y); \
__y *= __y; \
__inv *= (1 + __y); \
__y *= __y; \
__inv *= (1 + __y); \
__y *= __y; \
__inv *= (1 + __y); \
ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1); \
(inv) = __inv & GMP_NUMB_MASK; \
} while (0)
On my Ryzen 5 laptop, according to ./tune/speed -c -s1 binvert_limb,
time is reduced from 25.35 cycles to 21.90, about 16% speedup. Maybe one
could maybe save a cycle or two by using only 32-bit precision for all
but the last step?
But it looks like the logic (starting from the binvert_limb_table) is
duplicated in a few places, e.g., mpn/x86_64/dive_1.asm), so to adopt
this method, there are several places to update.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list