mpn_mulmod_bnm1

Thu Apr 3 00:06:55 UTC 2014

On 03/04/2014, at 7:33 AM, Niels Möller <nisse at lysator.liu.se> wrote:

> Torbjorn Granlund <tg at gmplib.org> writes:
> 
>> nisse at lysator.liu.se (Niels Möller) writes:
>> 
>>  I don't remember any fundamental difficulties.
>> 
>> Except the 2^m-1 one.
> 
> If I try to remember, I think it is like this. The fft is basically
> polynomial multiplication mod x^{2^n} - 1 (and the variant with mod
> x^{2^n}+1 instead should be easy). 
> 
> Then to get to and from integers, we evaluate for x = 2^k, where k is
> same order of magnitude as the word size. This gives us integer
> multiplication mod 2^{k 2^n} - 1, and for most useful parameters we
> have 2^{k 2^n} = B^m where m is an integer and, divisible by a pretty
> large power of two. E.g, with B = 2^{64}, m = k 2^{n-6}.
> 
> At this level, the small prime thing is an implementation detail. We
> choose a few small primes p_i (as well as k, the size of the pieces),
> such that the prime product exceeds the largest possibly coefficient in
> the product polynomial, which is bounded by 2^{2k} times the degree of
> the smallest factor, and work with images of the mappings
> Z[x]/(x^{2n}-1) --> Z_{p_i}[x](x^{2n}-1), and these images are
> multiplied efficiently using FFT over the fields Z_{p_i}.

Hi,

This might be an appropriate juncture for me to mention an idea I had a year or two ago, which I have been using successfully in my small-prime FFT library, and which or may not be useful in the context of GMP.

The basic question is how to do products modulo 2^m - 1 (or 2^m + 1) using a small-prime FFT.

Assume as above that m is divisible by a sufficiently large power of 2, so you can split into chunks of say b bits, and thus convert to polynomial multiplication in Z[x]/(x^n - 1), where n is a power of 2, and where the coefficients have b bits. Then you choose a bunch of small primes p_1, ..., p_r whose product is bigger than 2b + lg(n) bits. Then you use a small-prime FFT modulo each p_i.

The main problem with this approach is that there is not much granularity in the choice of r. You don't want r to get too large, because then the reduction/CRT steps can become quite expensive. In fact, you would actually prefer r to be quite small, and r should grow very slowly as a function of m. (I find typically r = 4 is pretty good.) The problem in the above sketch is that there is a big jump in running time as soon as b crosses a certain threshold where you need to add another prime.

One way to improve the "smoothness" is to allow non-power-of-two transform lengths, e.g. add in some radix-3 or radix-5 layers at the bottom. That way you get more flexibility with n, and thus you can tailor b to fit exactly into the desired number of primes. However, this introduces new problems: (a) more constraints on p_i, (b) a lot more work to code up the radix-3 and radix-5 transforms (especially if you want to do everything in assembly), and (c) the radix-3 and radix-5 transforms can never compete with radix-2 anyway. I tried this approach, but I still got big "bumps".

Another solution, which doesn't work at all, is to use TFTs. One cannot use TFTs directly for multiplication mod x^n - 1, because they evaluate at the wrong roots of unity.

So after playing around with all of these ideas for a while, I came up with a different solution in my own code, that I have been quite happy with.

In most algorithms calling for multiplication modulo 2^m - 1 (or 2^m + 1), such as fast versions of certain newton iterations, it's actually not that important that the modulus has exactly this form. Sometimes you really do need the wraparound, but often all you need is SOME modulus M of size 2^m such that multiplication modulo M has cost M(m/2) rather than M(m).

I propose using a modulus of the form

M = (B^{k u_1} + 1) (B^{k u_2 + 1) ... (B^{k u_s} + 1),

where

B = 2^64
k is some integer
u_1 > u_2 > ... > u_s is a decreasing sequence of powers of 2.

Then M is roughly of size B^{ku}, where u = u_1 + ... + u_s.

Example: if u = 13, then M = (B^{8k} + 1)(B^{4k} + 1)(B^k + 1).

I call such a modulus a "Fermat modulus", which is probably a terrible name, since I'm sure it's used for other things in the literature.

The scheme works as follows:

Suppose you want a modulus of size close to B^m.

Choose r = number of primes, any way you wish.

Choose u_1 (a power of two), which will control the overall granularity. (Something like u_1 = 64 works pretty well.)

Now we know u is going to be in the range u_1 <= u <= 2 u_1.

Choose a corresponding k so that multiplication modulo B^{u_1 k} + 1 can be done "optimally" for a small-prime FFT with r primes. Note that you have enormous flexibility in choosing k. You can thus ensure that b (the coefficient size) maps perfectly into the desired number of primes with no "wasted space".

Once we have chosen k, we can choose u_2, ..., u_s to make M pretty much exactly as big as we like, e.g. if u_1 = 64, then you get to choose M with less than 2% error.

Now there are three subroutines:

REDUCE: takes as input an arbitrary integer, and computes its residues modulo B^{k u_1} + 1, ..., B^{k u_s} + 1. The point is that this can be done in *linear* time. If I recall correctly, the cost is bounded by roughly two mpn_add's of size M. I don't have the energy to write out the details here, but the idea is basically to repeatedly use identities like B^{2k} - 1 = (B^k - 1)(B^k + 1), and only compute the parts you need. It ends up looking very similar to the top layers of the TFT algorithm, working on the input in chunks of k limbs.

MULTIPLY: multiplies modulo B^{k u_1} + 1, ..., B^{k u_s} + 1 separately, using small-prime FFTs. For the largest one (B^{k u_1} + 1), the coefficients will fit optimally into r primes. For the remaining ones, the coefficient will be slightly smaller, so you would prefer to use r - epsilon primes, but this is a not a big deal, since they rapidly get very small.

(In my code, I have additional data structures allowing me to reuse transforms for each of these components.)

CRT: takes as input residues modulo B^{k u_1} + 1, ..., B^{k u_s} + 1, and returns the unique corresponding residue modulo M. This can also be done in linear time. It is only slightly slower than REDUCE. Again I am not going to write out the details here, but one can give an algorithm very similar to the inverse TFT, it looks like a bunch of cross-butterflies, operating on chunks of k limbs.

In my code I get very smooth performance using this algorithm. One quick example, where M(n) = time for n-limb multiplication, I(n) = time for n-limb reciprocal:

    n        M(n)/M(10^7)      I(n)/M(n).
  1.0e7         1.000             1.687
  1.1e7         1.095             1.707
  1.2e7         1.200             1.701
  1.3e7         1.300             1.695
  1.4e7         1.394             1.697
  1.5e7         1.587             1.537
  1.6e7         1.670             1.710
  1.7e7         1.779             1.703
  1.8e7         1.873             1.701
  1.9e7         1.987             1.685
  2.0e7         2.109             1.668

The multiplications use TFTs, not "Fermat moduli". The reciprocal uses a theoretically 5/3 M(n) multiplication algorithm, using the Fermat moduli, and you can see I get very very close to this theoretical result. Everything here is with 4 primes.

The jump in M(n) at 1.5e7 is probably due to the TFT transform size crossing a power of two boundary. The jump in the reciprocal at 1.6e7 is probably the same effect. This is happening basically because multiplication is n log n, not simply n. These jumps could probably be mitigated to some extent, I have not optimised those TFT inner loops very much.

Note also this is not toy code... at 10^7 limbs the multiplication is about 1.5x faster than GMP 5.1.2, and the reciprocal is about 1.7x faster.

david