Paul.Zimmermann at inria.fr
Mon May 3 06:31:46 UTC 2021
> yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
> "Montgomery-Svoboda". The quotient selection becomes trivial, which means
> one can reduce the latency between two mpn_addmul_1 calls.
> It really becomes a mul_basecase except that the first round there is
> done with mul_1. If we replace that mul_1 by addmul_1, and
> then called the resulting function like this,
> mul_basecase'(rp, up, un, rp, rn-un)
> then we will get the desired remainder at mul_basecase speed.
a slight difference is that here the limb multiplier is not known in advance,
but only at the beginning of each loop. But maybe it does not make any
I tried to implement Montgomery-Svoboda at the C level, but did not manage
to beat the mpn_redc_x routines. I'm very interested to see your results!
More information about the gmp-devel