Improved binvert algorithm
David Sparks
sparks05 at proton.me
Sun Feb 8 02:35:20 CET 2026
The standard binvert iteration is 2-adic Newton's iteration.
It turns out there's an analogue to Goldschmidt's iteration as
well, with the following similar properties:
= Same number of operations (2 multiply, 1 add per iteration), and
+ All operations may be pipelinef and performed in parallel, but
- All operations (esp. multiplies) must all be full-width.
So for a processor with a non-pipelined, early-out multiplier such
as the 80386, or SPARCv7's multiply step, the current iteration is better.
But for recent x86 processors where both 32- and 64-bit multiplies
have 3-cycle latency and 1-cycle throughput, the advantage is
clearly on the other side.
The algorithm is from
"On Newton-Raphson iteration for multiplicative inverses modulo prime powers"
https://arxiv.org/abs/1209.6626
and
"An Improved Integer Modular Multiplicative Inverse (modulo 2^w)"
https://arxiv.org/abs/2204.04342
and works as follows:
__attribute__((const))
uint64_t binvert_64(uint64_t n)
{
uint64_t r = 3*n ^ 2; // 5 bits
uint64_t y = 1 - n*r;
r *= 1+y; // 10 bits
y *= y;
r *= 1+y; // 20 bits
y *= y;
r *= 1+y; // 40 bits
y *= y;
r *= 1+y; // 64 bits
return r;
}
The iteration throughput is limited to the latency if a single
multiply (3 cycles) rather than mul + mul + add (7 cycles).
Starting on the cycle when an 8-bit initial value of r is
available, things proceed:
Cycle 0: y = n*r
Cycle 3: y = 1 - y;
Cycle 4: temp = 1+y, y *= y,
Cycle 5: r *= temp
Cycle 7: temp = 1+y, y *= y
Cycle 8: r *= temp // 16-bit r
Cycle 10: temp = 1+y, y *= y
Cycle 11: r *= temp // 32-bit r
Cycle 13: temp = 1+y
Cycle 14: 64-bit r available
(The "temp" computed during cycles 4-5 could be computed in cycles
3-4 as "temp = 2 - y", but starting the multiply of r in cycle 4
doesn't actually help.)
If we start with a 4-bit approximation, then 3 more cycles are
required to get to 64 bits. This reduction in iteration cost
might change the value of a lookup table.
8-bit lookup table (r = binvert_limb_table[n >> 1 & 127]:
0: temp = n >> 1
1: temp &= 127
2: r = binvert_limb_table[temp]
6: 8-bit r is available (assuming 4-cycle L1 cache hit)
4-bit computation (r = 3*n ^ 2):
0: temp = 3*n (via LEA)
1: r = temp ^ 2
2: 4-bit r is available. (The 5th bit doesn't help us.)
So as long as L1D cache latency is more than multiply latency,
computation plus one more iteration wins. 4-cycle L1 latency
seems to be standard these days (Zen4, Golden Cove, Apple M4).
Even if they're equal, L1D cache latency is a lower bound on
actual load latency, so computation is preferable.
More information about the gmp-devel
mailing list