bipartite modular multiplication
Niels Möller
nisse at lysator.liu.se
Fri Nov 20 16:56:44 CET 2009
Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:
> Using the bipartite modular multiplication, one can get the upper half quotient
> in 2M(n/2), the lower half quotient in 2M(n/2), and one can use the Karatsuba
> trick to compute the middle correction in M(n/2), which gives a total of
> 5M(n/2) = 1.666M(n) instead of 2M(n).
Let me see if I can get the details right...
Say we have a_0 + a_1 B + a_2 B^2 + a_2 B^3, (where B =
2^{n*GMP_NUMB_BITS}, i.e., n-limb chunks, n here corresponds to n/2 in
your notation, I think). We want to reduce thtat to a number of 2n
limbs, modulo m_0 + m_1 B, and we don't particularly care of bits are
eliminated on the left or right side or both.
First, divide from the left,
a_2 + a_3 B = q_1 m_1 + r_1,
and from the right,
a_0 + a_1 B = q_0 m_0 + r_0 B
These divisions are either school-book, divide-and-conquer, or by
Barrett. In the latter case, they divisions take 2 M(n) each, one
multiply for the quotient and one for the remainder.
Then we put it together, writing
a_0 + a_1 B + a_2 B^2 + a_3 B^3 - (q_1 B + q_0) (m_1 B + m_0)
= (r_0 + r_1 B - q_1 m_0 - q_0 m_1) B
This looks like it needs two multiplications of size n, for a total
computation time of 6 M(n). If I understand you correctly, you refer
to some trick to compute this final correction using a single
multiplication.
If we use the Kratsuba-style product (q0 + q1) (m0 + m1), we get
r_0 + r_1 B - q_1 m_0 - q_0 m_1
= r_0 + r_1 B - (q_0 + q_1) (m_0 + m_1) + q_0 m_0 + q_1 m_1
Now substitute q_0 m_0 = a_0 + a_1 B - r_0 B and q_1 m_1 = a_2 + a_3 B
- r_1, to get
r_0 + r_1 B - q_1 m_0 - q_0 m_1
= a_0 + a_1 B + a_2 + a_3 B + (r_1 - r_0) (B-1) - (q_0 + q_1) (m_0 + m_1)
So this should work out, as two 2n/n divisions and only one additional
nxn multiplication, for a total cost (using Barrett) of 5 M(n).
Compare with ordinary redc, which would take 2 M(2n) using Barrett,
which equals 6 M(n) if multiplications are done using Karatsuba and
~5.5 M(n) when using Toom-3.
Do I get this right?
I imagine this could be a win also when using schoolbook or dc for the
divisions, as long as the linear cost of the combination is a small
fraction of the work.
Do you have any feeling how this saving compares to "generalized redc"
using the 2^n-1 instead of 2^n, as suggested by David?
Regards,
/Niels
More information about the gmp-devel
mailing list