mpn_mulmod_bnm1

Wed Apr 2 20:33:23 UTC 2014

Torbjorn Granlund <tg at gmplib.org> writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   I don't remember any fundamental difficulties.
>   
> Except the 2^m-1 one.

If I try to remember, I think it is like this. The fft is basically
polynomial multiplication mod x^{2^n} - 1 (and the variant with mod
x^{2^n}+1 instead should be easy). 

Then to get to and from integers, we evaluate for x = 2^k, where k is
same order of magnitude as the word size. This gives us integer
multiplication mod 2^{k 2^n} - 1, and for most useful parameters we
have 2^{k 2^n} = B^m where m is an integer and, divisible by a pretty
large power of two. E.g, with B = 2^{64}, m = k 2^{n-6}.

At this level, the small prime thing is an implementation detail. We
choose a few small primes p_i (as well as k, the size of the pieces),
such that the prime product exceeds the largest possibly coefficient in
the product polynomial, which is bounded by 2^{2k} times the degree of
the smallest factor, and work with images of the mappings
Z[x]/(x^{2n}-1) --> Z_{p_i}[x](x^{2n}-1), and these images are
multiplied efficiently using FFT over the fields Z_{p_i}.

>   * Tweaking the roots to compute mod 2^m+1 rather than 2^m-1 (since 2^m-1
>     is better handled by splitting in mulmod_bnm1).
>     
> Perhaps having both variants would be useful, a CRT of 2^m+1 and 2^m-1
> is particularly easy, and would perhaps result in faster general
> multiply than mpn_mulmod_bnm1 approach of roughly 2^m+1, 2^{m/2}+1,
> 2^{m/4}+1, ...

My understanding is that if we want to compute mod 2^m-1 (even m), it's
cheaper to do the two smaller computations mod 2^{m/2} \pm 1, plus the
O(m) work for the CRT, than to do a single larger fft multiply. Sloppily
speaking, the mulmod_bnm1 split wins because M(n) > 2 M(n/2) + O(m).

If this is correct, a function for computing mod 2^m-1 using the fft
directly is clearly pointless.

(And then mulmod_bnm1 also uses the same split recursively, for the
2^{m/2} - 1 part, but with diminishing returns for each additional
split).

>   * Implementing David Harvey's trick: slightly smaller primes, but
>     butterflies sped up from 9-10 or so cycles down to 6 (this was
>     benchmarked on Opteron, where 6 is the limit from multiplier
>     throughput).
>   
> That indeed seems like something that ought to be done already in
> version 1 of the checkin...

Hopefully, the higher-level pieces of the implementation shouldn't be
affected if/when we switch from 64-bit primes to 62-bits.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.