Improved binvert algorithm

Sun Feb 8 20:13:06 CET 2026

David Sparks <sparks05 at proton.me> writes:

> __attribute__((const))
> uint64_t binvert_64(uint64_t n)
> {
> 	uint64_t r = 3*n ^ 2;	//  5 bits
> 	uint64_t y = 1 - n*r;
> 	r *= 1+y;		// 10 bits
> 	y *= y;
> 	r *= 1+y;		// 20 bits
> 	y *= y;
> 	r *= 1+y;		// 40 bits
> 	y *= y;
> 	r *= 1+y;		// 64 bits
> 	return r;
> }

Interesting! To me, the clever thing is that the sequence of y values is
computed independently of r, with only a single multiply latency. And
the series of r-values is done in parallel, also with only a single
multiply latency per step. And then the latency of the 1+y operation is
only on the y --> r connection, so it is only paid once.

Tried this out, with this replacement in gmp-impl.h:

#define binvert_limb(inv,n)						\
  do {									\
    mp_limb_t  __n = (n);						\
    mp_limb_t  __inv = 3*__n ^ 2;					\
    mp_limb_t  __y = 1 - __n * __inv;					\
    __inv *= (1 + __y);							\
    __y *= __y;								\
    __inv *= (1 + __y);							\
    __y *= __y;								\
    __inv *= (1 + __y);							\
    __y *= __y;								\
    __inv *= (1 + __y);							\
    ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1);			\
    (inv) = __inv & GMP_NUMB_MASK;					\
  } while (0)

On my Ryzen 5 laptop, according to ./tune/speed -c -s1 binvert_limb,
time is reduced from 25.35 cycles to 21.90, about 16% speedup. Maybe one
could maybe save a cycle or two by using only 32-bit precision for all
but the last step?

But it looks like the logic (starting from the binvert_limb_table) is
duplicated in a few places, e.g., mpn/x86_64/dive_1.asm), so to adopt
this method, there are several places to update.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.