# Wraparound multiplicaton vs mullo

David Harvey dmharvey at cims.nyu.edu
Wed Sep 30 15:01:18 CEST 2009

```On Sep 30, 2009, at 5:30 AM, Niels =?iso-8859-1?Q?M=F6ller?= wrote:

> *  k = 1, p(x) = x^2 - 1, w_0 = 1, w_1 = -1
>
> Then we would compute a(±1) = a_0 ± a_1, b(±1) = b_0 ± b_1, multiply
> pointwise and reconstruct. I.e., 2 multiplications in the
> coefficient ring,
> compared to 3 for computing the full polynomial product using
> Karatsuba.

Right, this is equivalent to reducing modulo x - 1 and x + 1,
multiplying each separately, and reconstruct via CRT.

> * k = 1, p(x) = x^2 + 1, w = ±i (imaginary unit)
>
> We'd compute a(i) = a_0 + a_1 i, b(i) = b_0 + i b_1. We don't need to
> evaluate in a(-i), since a(-i) = conjugate a(i) (real coefficients).
> So we get only 1 complex mutliplication, which can be done as 3 real
> multiplications in the coefficient ring. Which is no saving compared
> to a full Karatsuba product.

I think because x^2 + 1 is irreducible over Q. If your coefficient
ring contained the roots of p(x), then you could do it with only two
multiplications.

> * k = 3, p(x) = x^3 - 1, w_0 = 1, w_1 = w, w_2 = w^2 = conjugate(w),
>
> with w = exp(i 2 pi/3). Again, there's no need to evaluate in w_2
> since a(w_2) = a(conjucate(w)) = conjugate(a(w)). The naïve way to
> multiply a(w) * b(w) involves 9 real multiplications in the base ring,
> which is way too expensive. Maybe there's some trick to reduce the
> number slightly (i.e., a trick for multpilcation in Z[x] / (x^3 - 1),
> similar to the "three real multiplies"-trick for Z[x] / (x^2 + 1)),
> but
> I don't think this is promising.

You can use x^3 - 1 = (x - 1)(x^2 + x + 1). The latter is irreducible
over Q. So you get M(n) = M(n/3) + M(2n/3) + O(n), which might still
be a saving depending on n. (In your language, 4 real multiplications

FLINT uses this trick in the Schonhage--Strassen routine for the
pointwise multiplications, in the form B^(3k) + 1 = (B^k + 1)(B^2k -
B^k + 1), where e.g. B = 2^64. Of course, this only works when the
coefficient size is divisible by 3.

> * k = 4, p(x) = x^4 - 1, w_j = { 1, -1, i, -i }
>
> We evaluate in 1, -1 and i  (-i is a free conjugate). Hence, we need
> 2 real and 1 complex multiplications, which can be done as 5 real
> multiplicatiions in the coefficient ring. This can be compared to 7
> coefficient multiplicatinos for full multiplication using toom4.

Alternatively, this can be viewed as applying the k = 2 trick
recursively. The next step would be

x^8 - 1 = (x^4 + 1)(x^2 + 1)(x + 1)(x - 1)

I guess this is all about cyclotomic polynomials. I'm fairly sure
I've seen an old paper that considers all kinds of tricks like this,
but I can't for the life of me remember who wrote it or when. If I
remember I'll get back to you.

> I'd like to compare the efficiency also to the mullo operation
> (multiplication mod 2^n). For large n (FFT range), computation mod 2^n
> - 1 is faster than computation mod 2^n. But which operation is fastest
> for smaller n?

That's a really good question. I'm pretty sure mod 2^n is faster in
the basecase region. But your argument above suggests that even in
the karatsuba range,  mod 2^n - 1 could be faster. I bet there is
some crossover point fairly early on, and that might affect all kinds
of other algorithms like division and square root.

david

```