Wraparound multiplicaton vs mullo

Niels=?iso-8859-1?Q?M=F6ller?= nisse at lysator.liu.se
Wed Sep 30 11:30:16 CEST 2009

There are several operations that can use wraparound multiplication,
i.e., multiplication mod 2^n + 1 or 2^n - 1 for some n, also for sizes
below the FFT range where this is a "natural" operation. These
operations include, I think:

* Basecase for the large coefficient fft.

* Various Newton iterations.

* Computation of a*b - c*d, when it is known a priori that there's a
  lot of cancellation at the most significant end (e.g, say a, b, c, d
  are all n bits, and it is known that 0 <= a*b - c*d < 2^{n+k} for
  some k << n.

Some of these can also use mullo or mulmid, instead of wrap-around.

Can we use Karatsuba/Toom-like tricks for wraparound multiplication?
For simplicity, I'll start with polynomials in this discussion, i.e.,
polynomial multiplication mod x^k + 1 or x^k - 1.

In general, of we have two polynomials a and b of degree at most k,
evaluate in k+1 distinct points w_i, and reconstruct a polynomial c
with the same values in these points, we get

  c = a * b mod p(x) where p(x) = (x-w_0) ... (x-w_k)

Let's consider some possibilities:

*  k = 1, p(x) = x^2 - 1, w_0 = 1, w_1 = -1

Then we would compute a(±1) = a_0 ± a_1, b(±1) = b_0 ± b_1, multiply
pointwise and reconstruct. I.e., 2 multiplications in the coefficient ring,
compared to 3 for computing the full polynomial product using

* k = 1, p(x) = x^2 + 1, w = ±i (imaginary unit)

We'd compute a(i) = a_0 + a_1 i, b(i) = b_0 + i b_1. We don't need to
evaluate in a(-i), since a(-i) = conjugate a(i) (real coefficients).
So we get only 1 complex mutliplication, which can be done as 3 real
multiplications in the coefficient ring. Which is no saving compared
to a full Karatsuba product.

* k = 3, p(x) = x^3 - 1, w_0 = 1, w_1 = w, w_2 = w^2 = conjugate(w),

with w = exp(i 2 pi/3). Again, there's no need to evaluate in w_2
since a(w_2) = a(conjucate(w)) = conjugate(a(w)). The naïve way to
multiply a(w) * b(w) involves 9 real multiplications in the base ring,
which is way too expensive. Maybe there's some trick to reduce the
number slightly (i.e., a trick for multpilcation in Z[x] / (x^3 - 1),
similar to the "three real multiplies"-trick for Z[x] / (x^2 + 1)), but
I don't think this is promising.

* k = 4, p(x) = x^4 - 1, w_j = { 1, -1, i, -i }

We evaluate in 1, -1 and i  (-i is a free conjugate). Hence, we need
2 real and 1 complex multiplications, which can be done as 5 real
multiplicatiions in the coefficient ring. This can be compared to 7
coefficient multiplicatinos for full multiplication using toom4.

It seems relevant to compare this case also to FFT, which gives 4
multiplications in a different coefficient ring.

Now move on to integers. Say inputs are n bits, splitting in k pieces
gives coefficents of n/k bits. The x^2 - 1 algorithm leads to
computation mod 2^n - 1 in time 2 M(n/2) + O(n), while the x^4 - 1
algorithm gives 5 M(n / 4) + O(n).

Lets see what this means in Schoolbook, Karatsuba and Toom ranges,
ignoring the O(n) term:

        Schoolbook  Karatsuba     Toom-3     Toom-4
exponent         2       1.59       1.46       1.40           
x^2 - 1  0.5  M(n)  0.66 M(n)  0.72 M(n)  0.76 M(n)
x^4 - 1  0.31 M(n)  0.56 M(n)  0.65 M(n)  0.71 M(n)

I'd like to compare the efficiency also to the mullo operation
(multiplication mod 2^n). For large n (FFT range), computation mod 2^n
- 1 is faster than computation mod 2^n. But which operation is fastest
for smaller n?

What are the corresponding numbers for Mulder's mullo algorithm? I
have looked quickly through G. Hanrot and Paul's paper "A long note on
Mulder's short product"
and if I understand the conclusions correctly the time is roughly 0.7
M(n) in the Karatsuba range. That seems to indicate that
multiplication mod 2^n-1 might be faster also for smaller sizes.


More information about the gmp-devel mailing list