Torbjörn Granlund tg at gmplib.org
Sun May 2 22:17:05 UTC 2021

Paul Zimmermann <Paul.Zimmermann at inria.fr> writes:

yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
"Montgomery-Svoboda". The quotient selection becomes trivial, which means
one can reduce the latency between two mpn_addmul_1 calls.

It really becomes a mul_basecase except that the first round there is
done with mul_1.  If we replace that mul_1 by addmul_1, and
then called the resulting function like this,

mul_basecase'(rp, up, un, rp, rn-un)

then we will get the desired remainder at mul_basecase speed.

If we do this, I think it might make sense to have two entry points in
mul_basecase as the bulk of the functions (which is sometimes very
large) could be shared.

(I completely separate improvement would be to chain multiplication and
squaring with Montgomery reduction.)

If you reduce k limbs at a time, by precomputing m = D^(-1) mod \beta^k,
then you can use mpn_addmul_k at each step. But to reduce the last k limbs,
you need to revert to classical (Montgomery) division, since mN has n+k limbs.

The importance of addmul_k for k > 1 has decreased im the last few
years.  But sure, using a "larger" inverse is an option when