# Montgomery multiplication (was: Re: Wraparound multiplicaton vs mullo)

Niels Möller nisse at lysator.liu.se
Sun Oct 18 21:05:05 CEST 2009

```David Harvey <dmharvey at cims.nyu.edu> writes:

> The simplest version runs as follows. Suppose you want to do
> arithmetic modulo N. Select a "working modulus" M, where M > N and
> (M, N) = 1. Precompute I = -1/N mod M. Given residues 0 <= A < N and
> 0 <= B < N, do the following:
>
> C = A*B
> P = C*I mod M
> Q = C + P*N
> R = Q/M

After thinking a little more, using M = B^n - 1 with M > 2N seems like
a fairly good choice. And the division is as easy as division by B^n:
Assume Q is divisible by B^n - 1, with quotient R. Then

Q = R (B^n - 1) = R B^n - R

and it follows that R = -Q mod B^n. And by choosing n so that 0 <= R <
B^n, this is all we need. So division is for free (the negation can be
doen for free inth e computation of Q), and furthermore, Q need only
be computed mod B^n. In all, it would be

C = X * Y
P = C * (1/N) mod (B^n - 1), with 1/N mod (B^n - 1) precomputed
R = (P * N - C) mod B^n

Compared to standard Montgomery, we have multiplication mod B^n-1
rather than multiplication mod B^n when computing P, and we have mullo
rather than mulhi in the the P*N term in the computation of R. It's
somewhat annoying that there's still one mullo left, but still looks
like a clear win for all sizes where multiplication mod B^n-1 is
faster than mullo.

to it's unholy mix of arithmetic mod N and arithmetic mod B^n. Now we
can take this to the next level by also introducing a third type of
arithmetic in the same algorithm, namely mod (B^n-1)... ;-)

(One difficulty may be that we must have gcd(N, B^n-1) = 1. But if
this is not true, I guess one could use the discovered partial
factorization of N and and compute mod each factor and CRT together
the results, and that should make it faster).

Regards,
/Niels
```