bipartite modular multiplication

Wed Apr 29 10:23:53 CEST 2009

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

> the bipartite modular multiplication from [KaTa08] below (I would prefer to
> call it bidirectional division) enables one to perform a "division" by both
> ends. The idea is the following. Assume you want to reduce a (2n)-limb product
> C modulo a divisor N of n limbs, and assume n is even for simplicity.
>
> You reduce the n/2 most significant limbs using classical division, and you
> reduce the n/2 least significant limbs using Hensel division (aka Montgomery
> REDC):
>
>             ___________________________________
>            |___H____|________|________|___L____|   input C (msb left)
>
>             __________________________
>            |________|________|________|            - Qh * N * B^(n/2)
>
>                      __________________________
>                     |________|________|________|   + Ql * N
>
>            --------------------------------------
>
>                      _________________
>                     |________|________|            Result
>
> (Here, B denotes the limb base, i.e., 2^32 or 2^64.)

Clever trick... Similar idea as the bidirectional exact division, but
here you get a corresponding "remainder" of a peculiar kind.

> My question is the following: could that bipartite modular multiplication be
> made faster than Montgomery's REDC (where we only reduce by the low end) in
> GMP?

You replace a single 2n by n division by two n by n/2 divisions. I
imagine that should give some speedup in most size ranges? In the
schoolbok-range (not sure what typical thresholds are) it makes some
sense to me mix the two loops in a single assembly function. When
getting into the divide-and-conquer range, that seems harder (and
I think threading at this level is not going to get into gmp).

For the important application of modular exponentiation, one ought to
precompute an classical n/2 inverse of the high half of N and a n/2
modular inverse of the low part.

The other REDC application I'm aware of is small-prime fft; I wonder
if this trick could make it useful to do small-prime fft with primes
that are two limbs? Given that the low limb of a suitable prime should
be 1, the redc-part should be trivial. If I get this right, for a
segment of the FFT that processes four limbs, it's four 1x1->2
products and four 2->1 redc vs one 2x2->4 multiplications (also four
1x1->2) and a single 4->2 reduction as above.

Regards,
/Niels