Improved binvert algorithm

Fri Mar 20 06:38:24 CET 2026

On Thursday, March 19th, 2026 at 22:47, marco.bodrato at tutanota.com <marco.bodrato at tutanota.com> wrote:
> 9 feb 2026, 20:58 da sparks05 at proton.me:
>> For maximum speed, we want the squaring of y to issue *before* the
>> inv *= 1+y multiply. If the object code doesn't occur in that order,
>> maybe tweaking the source to say "tmp = 1+y; y *= y; inv *= tmp;"
>> will induce it to, and shave off another cycle.
> 
> When we use assembly, then scheduling is our job, when we write in C it's
> a job for the compiler, isn't it?

Actually, it's the job of the out-of-order instruction scheduler.
Object-code order is only used as a tie-breaker.

The issue is that both computations have the same prerequisite, y,
and there is usually only one multiplier, so on the cycle it
becomes available, the hardware scheduler has a choice.  Hopefully,
it'll issue "tmp = y+1" and "y *= y" on one cycle, and then
"inv *= tmp" will issue the cycle after (assuming a pipelines
multiplier which can issue once per cycle).

> Moreover, if I remember correctly, the ARM processors have a
> MuLtiplyAccumulate instruction that can compute A * (B+1) without the
> need to add 1 to B (one computes A + B * A).

You're absolutely right!  And furthermore, GCC figures it out.

This source code:
#include <stdint.h>
uint64_t inv4(uint64_t n)
{
	uint64_t r = 3*n ^ 2;	//  5 bits
	uint64_t y = 1 - n*r;
	r *= 1+y;		// 10 bits
	y *= y;
	r *= 1+y;		// 20 bits
	y *= y;
	r *= 1+y;		// 40 bits
	y *= y;
	r *= 1+y;		// 64 bits
	return r;
}

uint64_t inv5(uint64_t n)
{
	uint64_t r = 3*n ^ 2;	//  5 bits
	uint64_t y0 = 1 - n*r, y1 =  y0*y0, y2 = y1*y1, y3 = y2*y2;
	r *= 1+y0;		// 10 bits
	r *= 1+y1;		// 20 bits
	r *= 1+y2;		// 40 bits
	r *= 1+y3;		// 64 bits
	return r;
}
can be fed to GCC-ARM64 to produce
# Compilation provided by Compiler Explorer at https://godbolt.org/
inv4(unsigned long):
        add     x4, x0, x0, lsl 1
        mov     x2, 1
        eor     x4, x4, 2
        mul     x3, x0, x4
        mov     x0, 2
        sub     x2, x2, x3
        sub     x0, x0, x3
        mul     x2, x2, x2
        mul     x0, x0, x4
        mul     x1, x2, x2
        mul     x5, x1, x1
        add     x1, x1, 1
        madd    x1, x2, x1, x1
        madd    x0, x5, x0, x0
        mul     x0, x1, x0
        ret
inv5(unsigned long):
        add     x4, x0, x0, lsl 1
        mov     x2, 1
        eor     x4, x4, 2
        mul     x3, x0, x4
        mov     x0, 2
        sub     x2, x2, x3
        sub     x0, x0, x3
        mul     x2, x2, x2
        mul     x0, x0, x4
        mul     x1, x2, x2
        mul     x5, x1, x1
        add     x1, x1, 1
        madd    x1, x2, x1, x1
        madd    x0, x5, x0, x0
        mul     x0, x1, x0
        ret
As you can see, the generated asm uses madd, and is identical.
Godbolt link:
https://godbolt.org//#eyJzZXNzaW9ucyI6W3siaWQiOjEsImxhbmd1YWdlIjoiYysrIiwic291cmNlIjoiI2luY2x1ZGUgPHN0ZGludC5oPlxyXG51aW50NjRfdCBpbnY0KHVpbnQ2NF90IG4pXHJcbntcclxuXHR1aW50NjRfdCByID0gMypuIF4gMjtcdC8vICA1IGJpdHNcclxuXHR1aW50NjRfdCB5ID0gMSAtIG4qcjtcclxuXHRyICo9IDEreTtcdFx0Ly8gMTAgYml0c1xyXG5cdHkgKj0geTtcclxuXHRyICo9IDEreTtcdFx0Ly8gMjAgYml0c1xyXG5cdHkgKj0geTtcclxuXHRyICo9IDEreTtcdFx0Ly8gNDAgYml0c1xyXG5cdHkgKj0geTtcclxuXHRyICo9IDEreTtcdFx0Ly8gNjQgYml0c1xyXG5cdHJldHVybiByO1xyXG59XHJcblxyXG51aW50NjRfdCBpbnY1KHVpbnQ2NF90IG4pXHJcbntcclxuXHR1aW50NjRfdCByID0gMypuIF4gMjtcdC8vICA1IGJpdHNcclxuXHR1aW50NjRfdCB5MCA9IDEgLSBuKnIsIHkxID0gIHkwKnkwLCB5MiA9IHkxKnkxLCB5MyA9IHkyKnkyO1xyXG5cdHIgKj0gMSt5MDtcdFx0Ly8gMTAgYml0c1xyXG5cdHIgKj0gMSt5MTtcdFx0Ly8gMjAgYml0c1xyXG5cdHIgKj0gMSt5MjtcdFx0Ly8gNDAgYml0c1xyXG5cdHIgKj0gMSt5MztcdFx0Ly8gNjQgYml0c1xyXG5cdHJldHVybiByO1xyXG59XHJcbiIsImNvbmZvcm1hbmNldmlldyI6ZmFsc2UsImNvbXBpbGVycyI6W3siX2ludGVybmFsaWQiOjEsImlkIjoiYXJtNjRnMTUyMCIsIm9wdGlvbnMiOiItTzIiLCJmaWx0ZXJzIjp7ImJpbmFyeSI6ZmFsc2UsImJpbmFyeU9iamVjdCI6ZmFsc2UsImNvbW1lbnRPbmx5Ijp0cnVlLCJkZW1hbmdsZSI6dHJ1ZSwiZGlyZWN0aXZlcyI6dHJ1ZSwiZXhlY3V0ZSI6ZmFsc2UsImludGVsIjp0cnVlLCJsYWJlbHMiOnRydWUsImxpYnJhcnlDb2RlIjpmYWxzZSwidHJpbSI6ZmFsc2UsImRlYnVnQ2FsbHMiOmZhbHNlfSwibGlicyI6W10sInNwZWNpYWxvdXRwdXRzIjpbXSwidG9vbHMiOltdLCJvdmVycmlkZXMiOltdfV0sImV4ZWN1dG9ycyI6W119XSwidHJlZXMiOltdfQ

On x86-64, the 
> Then, I'd declare as many variables y as needed and let the optimizer
> decide. Something like:

Nice idea, but it appears that both of our concerns were unfounded.
GCC sees the dependency chain and schedules the squarings first anyway.

On x86-64, there are some minor differences between the two, mostly in register assignment, but nothing to worry about.

I'm much less concerned about squarings after the first, as the front
end will have filled the issue window with decoded instructions during
the latency of the initial multiplies, even in the cold-cache case.

(And Zen5 has three full integer multipliers, so the issue becomes
moot.  This is likely not because it's needed, but because it simplifies
the instruction-dispatch logic, which is on the critical path.)