Improved binvert algorithm

Mon Feb 9 20:09:18 CET 2026

David Sparks <sparks05 at proton.me> writes:

> The thing stopping me from preparing a patch is I don't know of
> a "has pipelined multiplier" symbol to make it conditional on.
> Any suggestions?

For the generic C code, what one typically does in GMP is implement both
methods, use a

#define BINVERT_LIMB_METHOD ...

to select the method to use, and let the speed/tuneup program compare
them and produce the appropriate define for each of the gmp-mparam.h
files.

One can then also have a look at the threshold tables to see which
method is selected in each configuration, see, e.g.,
https://gmplib.org/devel/thres/DIV_QR_1N_PI1_METHOD. 

For assembly files that inline the iteration, one typically implements
only the single variant that seems most appropriate for that
architecture and cpu flavor.

To get an idea of the potential impact, maybe it would be a good first
step to update the one or two assembly files where potential speedup
seems largest?

> For maximum speed, we want the squaring of y to issue *before* the
> inv *= 1+y multiply.  If the object code doesn't occur in that order,
> maybe tweaking the source to say "tmp = 1+y; y *= y; inv *= tmp;"

On machines with out-of-order processing, I would expect the
ooo-hardware to make up for sligthly suboptimal instruction order, as
long as dependencies are right. But scheduling can have measurable
impact. But, in particular on x86_64 processors, it seems pretty hard to
predict the effect of manual scheduling. To get highest speed one often
have to try several variants (something I rarely have sufficient time
and patience to do well).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.