Paul Zimmermann Paul.Zimmermann at inria.fr
Sun May 2 16:38:39 UTC 2021

       Dear Torbjörn,

> IIRC, Svoboda's division trick for N/D is to find a small multiplier m
> such that, for the division mN/(mD) we have mD = 100000...XX... with the
> high limb of mD being 1000...0.
>
> This idea works also for 2-adic division.  Find m = D^(-1) mod \beta
> where \beta is the lomb base.  Then do mN/(mD) or mN mod (mD) with
> 2-adic norm.  Now mD will end in 0...0001, and the quotient limbs will
> be the low limbs of the intermediate remainder.
>
> That should mean that we could use mul_basecase directly for
> sbpi1_bdiv_r (or its sibling redc_1).
>
> Surely this is not a new observation.

yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
"Montgomery-Svoboda". The quotient selection becomes trivial, which means
one can reduce the latency between two mpn_addmul_1 calls.

If you reduce k limbs at a time, by precomputing m = D^(-1) mod \beta^k,
then you can use mpn_addmul_k at each step. But to reduce the last k limbs,
you need to revert to classical (Montgomery) division, since mN has n+k limbs.

Paul