Improved binvert algorithm

Sun Feb 8 02:35:20 CET 2026

The standard binvert iteration is 2-adic Newton's iteration.
It turns out there's an analogue to Goldschmidt's iteration as
well, with the following similar properties:

= Same number of operations (2 multiply, 1 add per iteration), and
+ All operations may be pipelinef and performed in parallel, but
- All operations (esp. multiplies) must all be full-width.

So for a processor with a non-pipelined, early-out multiplier such
as the 80386, or SPARCv7's multiply step, the current iteration is better.

But for recent x86 processors where both 32- and 64-bit multiplies
have 3-cycle latency and 1-cycle throughput, the advantage is
clearly on the other side.

The algorithm is from
"On Newton-Raphson iteration for multiplicative inverses modulo prime powers"
https://arxiv.org/abs/1209.6626
and
"An Improved Integer Modular Multiplicative Inverse (modulo 2^w)"
https://arxiv.org/abs/2204.04342

and works as follows:

__attribute__((const))
uint64_t binvert_64(uint64_t n)
{
	uint64_t r = 3*n ^ 2;	//  5 bits
	uint64_t y = 1 - n*r;
	r *= 1+y;		// 10 bits
	y *= y;
	r *= 1+y;		// 20 bits
	y *= y;
	r *= 1+y;		// 40 bits
	y *= y;
	r *= 1+y;		// 64 bits
	return r;
}

The iteration throughput is limited to the latency if a single
multiply (3 cycles) rather than mul + mul + add (7 cycles).

Starting on the cycle when an 8-bit initial value of r is
available, things proceed:
Cycle 0: y = n*r
Cycle 3: y = 1 - y;
Cycle 4: temp = 1+y, y *= y, 
Cycle 5: r *= temp
Cycle 7: temp = 1+y, y *= y
Cycle 8: r *= temp         // 16-bit r
Cycle 10: temp = 1+y, y *= y
Cycle 11: r *= temp        // 32-bit r
Cycle 13: temp = 1+y
Cycle 14: 64-bit r available

(The "temp" computed during cycles 4-5 could be computed in cycles
3-4 as "temp = 2 - y", but starting the multiply of r in cycle 4
doesn't actually help.)

If we start with a 4-bit approximation, then 3 more cycles are
required to get to 64 bits.  This reduction in iteration cost
might change the value of a lookup table.

8-bit lookup table (r = binvert_limb_table[n >> 1 & 127]:
0: temp = n >> 1
1: temp &= 127
2: r = binvert_limb_table[temp]
6: 8-bit r is available (assuming 4-cycle L1 cache hit)

4-bit computation (r = 3*n ^ 2):
0: temp = 3*n (via LEA)
1: r = temp ^ 2
2: 4-bit r is available.  (The 5th bit doesn't help us.)

So as long as L1D cache latency is more than multiply latency,
computation plus one more iteration wins.  4-cycle L1 latency
seems to be standard these days (Zen4, Golden Cove, Apple M4).

Even if they're equal, L1D cache latency is a lower bound on
actual load latency, so computation is preferable.