bipartite modular multiplication
nisse at lysator.liu.se
Wed Apr 29 10:23:53 CEST 2009
Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:
> the bipartite modular multiplication from [KaTa08] below (I would prefer to
> call it bidirectional division) enables one to perform a "division" by both
> ends. The idea is the following. Assume you want to reduce a (2n)-limb product
> C modulo a divisor N of n limbs, and assume n is even for simplicity.
> You reduce the n/2 most significant limbs using classical division, and you
> reduce the n/2 least significant limbs using Hensel division (aka Montgomery
> |___H____|________|________|___L____| input C (msb left)
> |________|________|________| - Qh * N * B^(n/2)
> |________|________|________| + Ql * N
> |________|________| Result
> (Here, B denotes the limb base, i.e., 2^32 or 2^64.)
Clever trick... Similar idea as the bidirectional exact division, but
here you get a corresponding "remainder" of a peculiar kind.
> My question is the following: could that bipartite modular multiplication be
> made faster than Montgomery's REDC (where we only reduce by the low end) in
You replace a single 2n by n division by two n by n/2 divisions. I
imagine that should give some speedup in most size ranges? In the
schoolbok-range (not sure what typical thresholds are) it makes some
sense to me mix the two loops in a single assembly function. When
getting into the divide-and-conquer range, that seems harder (and
I think threading at this level is not going to get into gmp).
For the important application of modular exponentiation, one ought to
precompute an classical n/2 inverse of the high half of N and a n/2
modular inverse of the low part.
The other REDC application I'm aware of is small-prime fft; I wonder
if this trick could make it useful to do small-prime fft with primes
that are two limbs? Given that the low limb of a suitable prime should
be 1, the redc-part should be trivial. If I get this right, for a
segment of the FFT that processes four limbs, it's four 1x1->2
products and four 2->1 redc vs one 2x2->4 multiplications (also four
1x1->2) and a single 4->2 reduction as above.
More information about the gmp-devel