bipartite modular multiplication

Fri Nov 20 17:36:45 CET 2009

       Niels,

> From: nisse at lysator.liu.se (Niels =?iso-8859-1?Q?M=F6ller?=)
> Date: Fri, 20 Nov 2009 16:56:44 +0100
> 
> Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:
> 
> > Using the bipartite modular multiplication, one can get the upper half quotient
> > in 2M(n/2), the lower half quotient in 2M(n/2), and one can use the Karatsuba
> > trick to compute the middle correction in M(n/2), which gives a total of
> > 5M(n/2) = 1.666M(n) instead of 2M(n).
> 
> Let me see if I can get the details right...
> 
> Say we have a_0 + a_1 B + a_2 B^2 + a_2 B^3, (where B =
> 2^{n*GMP_NUMB_BITS}, i.e., n-limb chunks, n here corresponds to n/2 in
> your notation, I think). We want to reduce thtat to a number of 2n
> limbs, modulo m_0 + m_1 B, and we don't particularly care of bits are
> eliminated on the left or right side or both.
> 
> First, divide from the left,
> 
>   a_2 + a_3 B = q_1 m_1 + r_1,

right.

> and from the right,
> 
>   a_0 + a_1 B = q_0 m_0 + r_0 B

ok.

> These divisions are either school-book, divide-and-conquer, or by
> Barrett. In the latter case, they divisions take 2 M(n) each, one
> multiply for the quotient and one for the remainder.

note that in the Karatsuba range, you also get 2M(n) with the
divide-and-conquer division.

> Then we put it together, writing
> 
>   a_0 + a_1 B + a_2 B^2 + a_3 B^3 - (q_1 B + q_0) (m_1 B + m_0)
> 
>   = (r_0 + r_1 B - q_1 m_0 - q_0 m_1) B

correct.

> This looks like it needs two multiplications of size n, for a total
> computation time of 6 M(n). If I understand you correctly, you refer
> to some trick to compute this final correction using a single
> multiplication.

yes. We already know q_1 m_1 = a_2 + a_3 B - r_1, and
q_0 m_0 = a_0 + a_1 B - r_0 B, thus (q_0 + q_1) * (m_0 + m_1) minus
these two products will give q_1 m_0 + q_0 m_1.

> If we use the Kratsuba-style product (q0 + q1) (m0 + m1), we get
> 
>   r_0 + r_1 B - q_1 m_0 - q_0 m_1
>   = r_0 + r_1 B - (q_0 + q_1) (m_0 + m_1) + q_0 m_0 + q_1 m_1

you got it right.

> Now substitute q_0 m_0 = a_0 + a_1 B - r_0 B and q_1 m_1 = a_2 + a_3 B
> - r_1, to get
> 
>   r_0 + r_1 B - q_1 m_0 - q_0 m_1
>   = a_0 + a_1 B + a_2 + a_3 B + (r_1 - r_0) (B-1) - (q_0 + q_1) (m_0 + m_1)
> 
> So this should work out, as two 2n/n divisions and only one additional
> nxn multiplication, for a total cost (using Barrett) of 5 M(n).

or using the divide and conquer division.

> Compare with ordinary redc, which would take 2 M(2n) using Barrett,
> which equals 6 M(n) if multiplications are done using Karatsuba and
> ~5.5 M(n) when using Toom-3.
> 
> Do I get this right?

yes.

> I imagine this could be a win also when using schoolbook or dc for the
> divisions, as long as the linear cost of the combination is a small
> fraction of the work.

note sure for schoolbook, since then the total cost is always n^2. But clearly
for divide&conquer division.

> Do you have any feeling how this saving compares to "generalized redc"
> using the 2^n-1 instead of 2^n, as suggested by David?

I have no idea. You have to try. Anyway this will only win in a context where
we perform several modular multiplications with the same modulus, for example
in mpz_powm.

> Regards,
> /Niels

Paul