bipartite modular multiplication

Niels Möller nisse at lysator.liu.se
Fri Nov 20 16:56:44 CET 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

> Using the bipartite modular multiplication, one can get the upper half quotient
> in 2M(n/2), the lower half quotient in 2M(n/2), and one can use the Karatsuba
> trick to compute the middle correction in M(n/2), which gives a total of
> 5M(n/2) = 1.666M(n) instead of 2M(n).

Let me see if I can get the details right...

Say we have a_0 + a_1 B + a_2 B^2 + a_2 B^3, (where B =
2^{n*GMP_NUMB_BITS}, i.e., n-limb chunks, n here corresponds to n/2 in
your notation, I think). We want to reduce thtat to a number of 2n
limbs, modulo m_0 + m_1 B, and we don't particularly care of bits are
eliminated on the left or right side or both.

First, divide from the left,

a_2 + a_3 B = q_1 m_1 + r_1,

and from the right,

a_0 + a_1 B = q_0 m_0 + r_0 B

These divisions are either school-book, divide-and-conquer, or by
Barrett. In the latter case, they divisions take 2 M(n) each, one
multiply for the quotient and one for the remainder.

Then we put it together, writing

a_0 + a_1 B + a_2 B^2 + a_3 B^3 - (q_1 B + q_0) (m_1 B + m_0)

= (r_0 + r_1 B - q_1 m_0 - q_0 m_1) B

This looks like it needs two multiplications of size n, for a total
computation time of 6 M(n). If I understand you correctly, you refer
to some trick to compute this final correction using a single
multiplication.

If we use the Kratsuba-style product (q0 + q1) (m0 + m1), we get

r_0 + r_1 B - q_1 m_0 - q_0 m_1
= r_0 + r_1 B - (q_0 + q_1) (m_0 + m_1) + q_0 m_0 + q_1 m_1

Now substitute q_0 m_0 = a_0 + a_1 B - r_0 B and q_1 m_1 = a_2 + a_3 B
- r_1, to get

r_0 + r_1 B - q_1 m_0 - q_0 m_1
= a_0 + a_1 B + a_2 + a_3 B + (r_1 - r_0) (B-1) - (q_0 + q_1) (m_0 + m_1)

So this should work out, as two 2n/n divisions and only one additional
nxn multiplication, for a total cost (using Barrett) of 5 M(n).

Compare with ordinary redc, which would take 2 M(2n) using Barrett,
which equals 6 M(n) if multiplications are done using Karatsuba and
~5.5 M(n) when using Toom-3.

Do I get this right?

I imagine this could be a win also when using schoolbook or dc for the
divisions, as long as the linear cost of the combination is a small
fraction of the work.

Do you have any feeling how this saving compares to "generalized redc"
using the 2^n-1 instead of 2^n, as suggested by David?

Regards,
/Niels