Wraparound multiplicaton vs mullo

Wed Sep 30 15:59:32 CEST 2009

David Harvey <dmharvey at cims.nyu.edu> writes:

>> * k = 1, p(x) = x^2 + 1, w = ±i (imaginary unit)
>>
>> We'd compute a(i) = a_0 + a_1 i, b(i) = b_0 + i b_1. We don't need to
>> evaluate in a(-i), since a(-i) = conjugate a(i) (real coefficients).
>> So we get only 1 complex mutliplication, which can be done as 3 real
>> multiplications in the coefficient ring. Which is no saving compared
>> to a full Karatsuba product.
>
> I think because x^2 + 1 is irreducible over Q. If your coefficient
> ring contained the roots of p(x), then you could do it with only two
> multiplications.

The way I think of it, coefficients are in the ring Z, I extend this
with the imaginary unit, to get Z[i] which is the same as Z[x] / (x^2
+ 1). (If one extends to the quotient ring, it also happens to be a
field, since x^2 + 1 is irreducible, but I don't think field
properties are important, all we need is that particular elements that
turn up in the interpolation stage are invertible).

Do you have some other extension or embedding of the coefficients in
mind?

>> * k = 3, p(x) = x^3 - 1, w_0 = 1, w_1 = w, w_2 = w^2 = conjugate(w),

> You can use x^3 - 1 = (x - 1)(x^2 + x + 1). The latter is irreducible
> over Q. So you get M(n) = M(n/3) + M(2n/3) + O(n), which might still
> be a saving depending on n. (In your language, 4 real multiplications
> instead of 9.)

It's long time since I did this kind of algebra. The root w above
should still be the relevant primitive root. But what your saying is
that it has order 2, not 3, thanks to the factorization.

But then you lose me. We still have to do multiplication in the ring
Z[X] / (x^2 + x + 1). If we represent elements as polynomials of
degree one, we get multiplication as

  (a0 + a1 x) * (b0 + b1 x)
    = a0*b0 + (a0*b1 + a1*b0) x + a1*b1 x^2
    = a0*b0 + (a0*b1 + a1*b0) x - a1*b1 (x + 1)   (mod (x^2 + x + 1))

which needs 4 multiplications in the coefficient ring. And then we
also have the (x-1) factor/modulo/evaluation which also corresponds to
one multiply. So in total, we get 5 real multiplicatinos, the same as
for a full Toom-3 multiplication.

Or are you thinking of some trick to reduce the multiplication in Z[x]
/ (x^2 + x + 1) to 3, rather than 4, real multiplications? Maybe it
follows directly in some way I don't understand from the corresponding
trick for Z[x] / (x^2 + 1) ?

> That's a really good question. I'm pretty sure mod 2^n is faster in
> the basecase region.

For the smallest sizes, basecase mullo ought to win, but it's not
obvious to me how the thresholds for the x^4 - 1 algorithm relates to the
Karatsuba and Toom thresholds.

Schoolbook mullo takes 0.5 M(n) = 0.5 n^2. The x^2 - 1 has the same
quadratic term but larger linear overhead, so it can't be faster. x^4
- 1 algorithm takes 0.31 n^2 + O(n) (if I got thtat calculation
right). So it *might* be faster even below the Karatsuba threshold,
depending on the linear koefficient in the running time.

On x86_64, the Karatsuba threshold seems to be 28 limbs, so the x^4 - 1
algorithm will use schoolbook for the recursive calls up to n = 112
limbs. And the x^2 - 1 algorithm cannot be faster than schoolbook
mullo until sizes are a bit above twice the Karatsuba threshold, i.e
56 words. Quite difficult to predict which algorithm will win in the
range 50-200 limbs, where the coefficient multiplies are using
schoolbook or Karatsuba.

Regards,
/Niels