New function mpn_mulmod_bnm1

Thu Oct 29 17:59:30 CET 2009

I wrapped and integrated Niels' code for multiplication mod B^n-1.  The
code is now in the GMP repository, <http://gmplib.org:8000/gmp/>.

To recapitulate, the code is based on the observation that UV mod B^n-1
when n is even can be computed as UV mod B^{n/2}-1 and UV mod B^{n/2}+1
followed by a fast CRT.  The mod B^{n/2}-1 part is computed recursively
as long as the exponent is even.  The mod B^{n/2}+1 part is computed
differently depending on operand size, as explained below.

Since this new function is meant not for computation of products modulo
B^n-1 for certain fixed n, but rather as a multiplication primitive for
when we want the full product but the "wrapping" can be undone, we can
typically choose an n' >= n with better properties and compute mod
B^n'-1 instead.  "More even" n' is one such property, since it allows
continued recursive computation of the B^{n/2}-1 part before n/2 is not
an integer.

The present FFT code in GMP (Schönhage-Strassen FFT multiplication)
computes mod B^m+1 for certain values of m.  This is usually a
disadvantage, but for our product it is exactly what we need, since it
handles the mod B^{n/2}+1 part directly.  For small operands, where FFT
is not fast, we use full multiplication followed by explicit reduction
mod B^{n/2}+1.  But since this is the home turf of the FFT code, the
crossover point to FFT is as low as a few hundred limbs.

The new function is called mpn_mulmod_bnm1.  A related function
mpn_mulmod_bnm1_next_size can be used for suggesting n'.

The interface is mpn_mulmod_bnm1 (rp, ap, bp, n', tp) where ap and bp
are pointers to the input operands, and rp should point to the result.
The common size is in n'.  Scratch space is passed in tp, its size
should be mpn_mulmod_bnm1_itch(n').

Note that this is an internal interface, and it will as such change.  We
actually plan to change it to take at least one more size argument,
accepting both n and n', making padding of input arguments unnecessary.

What are the performance results like?  It starts to help already from
between 10 and 20 limbs.  It then saves about 50% to 60% of the
computation time.

Next we should make the many places that now uses FFT wraparound to use
this function instead.  This means Newton iterations everywhere, the gcd
code, and large-operands division code.

This is an exiting new development which I think can be said to be a
result of the recent discussions here at the gmp-devel list, where Niels
Möller, David Harvey, Marco Bodrato and Paul Zimmermann contributed
valuable insights.  I actually cannot say who cracked the key ideas, so
please forgive me for not giving specific credit!

Let's keep using gmp-devel as a forum for research about arithmetic
algorithms!

-- 
Torbjörn