bipartite modular multiplication

Paul Zimmermann Paul.Zimmermann at loria.fr
Wed Apr 29 09:18:24 CEST 2009


       Hi,

the bipartite modular multiplication from [KaTa08] below (I would prefer to
call it bidirectional division) enables one to perform a "division" by both
ends. The idea is the following. Assume you want to reduce a (2n)-limb product
C modulo a divisor N of n limbs, and assume n is even for simplicity.

You reduce the n/2 most significant limbs using classical division, and you
reduce the n/2 least significant limbs using Hensel division (aka Montgomery
REDC):

            ___________________________________
           |___H____|________|________|___L____|   input C (msb left)

            __________________________
           |________|________|________|            - Qh * N * B^(n/2)

                     __________________________
                    |________|________|________|   + Ql * N

           --------------------------------------

                     _________________
                    |________|________|            Result

(Here, B denotes the limb base, i.e., 2^32 or 2^64.)

As you can see, the upper quotient Qh only depends on the upper n/2 limbs H,
and the lower quotient only depends on the lower n/2 limbs L, thus there is
no dependency between the computation of Qh and Ql. (There might be a carry out
in the result, but let's ignore it.)

This operation can be used in modular multiplication as follows: convert
a residue X to X' := X*B^(n/2), and replace multiplications by X'*Y'/B^(n/2)
= (X*Y)*B^(n/2) = (X*Y)'. This is the same as for Montgomery's REDC, except
that B^n is replaced by B^(n/2).

My question is the following: could that bipartite modular multiplication be
made faster than Montgomery's REDC (where we only reduce by the low end) in
GMP? Indeed, in REDC or in classical division, you cannot reduce a limb before
having reduced the previous one, thus there is a strong dependency in the
reduction loop. Here one could reduce limbs alternatively between the most and
least significant parts, while waiting for the new limbs to be ready.

Moreover using 2 threads, one could reduce by both ends simultaneously. This
might give the fastest modular multiplication in the world...

Since I know nothing about assembly code and/or dependencies, comments are most
welcome.

Paul Zimmermann

@Article{KaTa08,
  author = 	 {Marcelo E. Kaihara and Naofumi Takagi},
  title = 	 {Bipartite Modular Multiplication Method},
  journal = 	 ieeetc,
  year = 	 2008,
  volume = 	 57,
  number = 	 2,
  pages = 	 {157--164}
}


More information about the gmp-devel mailing list