# Generalizing Montgomery mod algorithm

Niels Möller nisse at lysator.liu.se
Mon Jun 11 21:38:14 CEST 2012

```Hi,

I think I have (re-)discovered a different schoolbook division
algorithm. I'll describe the case that we only want the remainder. I'm
fairly confident one can assemble the quotient as well, with a bit of
extra book-keeping, but I'm less sure that it can be faster than a
schoolbook division loop around 3/2 division.

Say we have an n-limb number M. Then there exists a number v such that

v M = B^{n+1} - M'

where M' is at most n limbs. There are multiple alternatives, in particular
if the top limb of M is not normalized. One choice is

v = floor [B^{n+1} / M]

This is useful if we rewrite it as

B^{n+1} = M' (mod M)

Now consider an U of n+2 limbs,

U = u_{n+1} B^{n+1} + u_n B^n + ... + u0.

Then we can reduce it by one limb using

u_{n+1} B^{n+1} = u_{n+1} M' (mod M)

where the righthand side is n+1 limbs. When adding to the other low u
limbs, we may get a carry, and in this case we'd need to add in M' one
more time (and then we can't get another carry, if I'm getting the
bounds right).

So the main loop would be something like

hi = up[--un];
while (un > mn)
{
hi = mpn_addmul_1 (up + un - mn - 1, mp, mn, hi);
ul = up[--un];
hi += ul;
if (hi < ul)
hi += mpn_add_n (up + un - mn, mp, mn);
}

Here "mp" should point to the value M' above. Not sure what the
statistics for the carry is, but it's going to become less likely the
more unnormalized the top limb of M is.

Once U is reduced to n + 1 limbs, one needs to do two more or less plain
divisions to get a final remainder in the range 0 <= R < M. The needed
inverses are intimately related to the value v computed above.

When can this be useful? Some thoughts:

* If the carry is reasonably unlikely, then I think this may be able to
beat plain schoolbook, since there are fewer and simpler adjustments.
It's looks like a left-to-right redc, but with more complex carry
handling.

* It may be attractive for a "constant time" division. Then it doesn't
really matter if the adjustment is unlikely, since we're going to need

* If we want to try to get the quotient as well, then I think we should
go for a v which is a single limb (and a fix one it). Let 2^s M be
normalized, and set v = floor [(B^{n+1} - 1) / (2^s M)]. We then get
the quotients by multiplying the sequence of top limbs in the
algorithm by (B + v), either during the loop or an additonal mul_1
after the loop.

* The multiplier v can be thought of as an (n+1)/n inverse. It can be
computed by first computing a 3/2 inverse v, then forming (B+v) * M,
and based on that one can both adjust v and compute M'.

Regards,
/Niels

--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

```