Improved binvert algorithm

Sun Feb 8 21:39:16 CET 2026

Niels Möller <nisse at lysator.liu.se> wrote:
> David Sparks sparks05 at proton.me writes: 
>> attribute((const))
>> uint64_t binvert_64(uint64_t n)
>> {
>> uint64_t r = 3n ^ 2; // 5 bits
>> uint64_t y = 1 - nr;
>> r *= 1+y; // 10 bits
>> y *= y;
>> r *= 1+y; // 20 bits
>> y *= y;
>> r *= 1+y; // 40 bits
>> y *= y;
>> r *= 1+y; // 64 bits
>> return r;
>> }
> 
> Interesting! To me, the clever thing is that the sequence of y values is
> computed independently of r, with only a single multiply latency. And
> the series of r-values is done in parallel, also with only a single
> multiply latency per step. And then the latency of the 1+y operation is
> only on the y --> r connection, so it is only paid once.

Yes, I thought it was very nifty when I saw it.

> Tried this out, with this replacement in gmp-impl.h:
> 
> #define binvert_limb(inv,n) \
> do { \
> mp_limb_t __n = (n); \
> mp_limb_t __inv = 3*__n ^ 2; \
> mp_limb_t __y = 1 - __n * __inv; \
> __inv *= (1 + __y); \
> __y *= __y; \
> __inv *= (1 + __y); \
> __y *= __y; \
> __inv *= (1 + __y); \
> __y *= __y; \
> __inv *= (1 + __y); \
> ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1); \
> (inv) = __inv & GMP_NUMB_MASK; \
> } while (0)
> 

> On my Ryzen 5 laptop, according to ./tune/speed -c -s1 binvert_limb,
> time is reduced from 25.35 cycles to 21.90, about 16% speedup. Maybe one
> could maybe save a cycle or two by using only 32-bit precision for all
> but the last step?

There are two reasons why this won't work:
1) On recent x86 CPUs, 32-bit multiplication is no faster
   (it's 3-cycle latency either way), and
2) As I mentioned, all operations must be full-width to get the
   correct answer.  Like Goldschmit's, and unlike Newton's iteration,
   it's not self-correcting.  Notice that n isn't used in the
   iteration; it enters only through y, so y must preserve all its
   bits.  Likewise, y's lack of dependency on the previous value of
   r which gives the technique its speed means that all 64 bits
   of r must be uncorrupted.

Now, it is possible to switch back and forth, using the pipelined
algorithm to get to 32 bits and then widening to 64 bits with the
old iteration.  For example, consider a hypothetical multiplier
with 3-cycle latency for 32-bit operations, but 5-cycle latency
for 64-bit operations.

The 64-bit pipelined implementation is no longer 4*3+2=14 cycles,
but now 4*5+2 = 22 cycles.

But it can compute a 32-bit inverse in 3*3+2 = 11 cycles, then
needs 2*5+1 = 11 more for the final 64-bit widening.  Which also
totals 22 cycles.  You need a >= 2x speed penalty to overtake.

> But it looks like the logic (starting from the binvert_limb_table) is
> duplicated in a few places, e.g., mpn/x86_64/dive_1.asm), so to adopt
> this method, there are several places to update.

The thing stopping me from preparing a patch is I don't know of
a "has pipelined multiplier" symbol to make it conditional on.
Any suggestions?

For maximum speed, we want the squaring of y to issue *before* the
inv *= 1+y multiply.  If the object code doesn't occur in that order,
maybe tweaking the source to say "tmp = 1+y; y *= y; inv *= tmp;"
will induce it to, and shave off another cycle.