Wraparound multiplicaton vs mullo

Wed Sep 30 16:45:44 CEST 2009

On Sep 30, 2009, at 9:59 AM, Niels Möller wrote:

> David Harvey <dmharvey at cims.nyu.edu> writes:
>
>>> * k = 1, p(x) = x^2 + 1, w = ±i (imaginary unit)
>>>
>>> We'd compute a(i) = a_0 + a_1 i, b(i) = b_0 + i b_1. We don't  
>>> need to
>>> evaluate in a(-i), since a(-i) = conjugate a(i) (real coefficients).
>>> So we get only 1 complex mutliplication, which can be done as 3 real
>>> multiplications in the coefficient ring. Which is no saving compared
>>> to a full Karatsuba product.
>>
>> I think because x^2 + 1 is irreducible over Q. If your coefficient
>> ring contained the roots of p(x), then you could do it with only two
>> multiplications.
>
> The way I think of it, coefficients are in the ring Z, I extend this
> with the imaginary unit, to get Z[i] which is the same as Z[x] / (x^2
> + 1). (If one extends to the quotient ring, it also happens to be a
> field, since x^2 + 1 is irreducible, but I don't think field
> properties are important, all we need is that particular elements that
> turn up in the interpolation stage are invertible).
>
> Do you have some other extension or embedding of the coefficients in
> mind?

No, no other embedding in mind. It's just that you seemed to be  
talking about polynomial arithmetic in general, and I was just  
pointing out the fact (that you already knew) that the number of  
coefficient operations would depend on the coefficient ring. Over Z  
you are correct, 3 mults are required.

> Or are you thinking of some trick to reduce the multiplication in Z[x]
> / (x^2 + x + 1) to 3, rather than 4, real multiplications? Maybe it
> follows directly in some way I don't understand from the corresponding
> trick for Z[x] / (x^2 + 1) ?

you are making this more complicated than necessary :-)

They are linear polynomials. Multiply them using three mults using  
karatsuba. Then reduce modulo x^2 + x + 1; this needs only additions  
and subtractions.

Something else to think about. Suppose you use the modulus (x-1)(x-2) 
(x-3)(x-4). Multiplication modulo this costs 4 M(n/4) + O(n). The O 
(n) term is uncomfortably large. But maybe one could build a nice  
REDC on top of this anyway. Of course 1, 2, 3, 4 might not be the  
best evaluation points to use.

>> That's a really good question. I'm pretty sure mod 2^n is faster in
>> the basecase region.
>
> For the smallest sizes, basecase mullo ought to win, but it's not
> obvious to me how the thresholds for the x^4 - 1 algorithm relates  
> to the
> Karatsuba and Toom thresholds.

Not obvious to me either.

One huge complication is that multiplication modulo 2^n introduces a  
nasty branch misprediction, since the accumulation regions are  
triangles everywhere, leading to a very large extra O(n) term. Maybe  
Torbjorn found a clever way to deal with this in the squaring code?

> Schoolbook mullo takes 0.5 M(n) = 0.5 n^2. The x^2 - 1 has the same
> quadratic term but larger linear overhead, so it can't be faster.

Oops yes you are right (except possibly there's the branch  
misprediction issue).

> x^4 - 1 algorithm takes 0.31 n^2 + O(n) (if I got thtat calculation
> right).

Yes, it's 5/16 n^2.

> So it *might* be faster even below the Karatsuba threshold,
> depending on the linear koefficient in the running time.

Possibly.

> On x86_64, the Karatsuba threshold seems to be 28 limbs, so the x^4  
> - 1
> algorithm will use schoolbook for the recursive calls up to n = 112
> limbs. And the x^2 - 1 algorithm cannot be faster than schoolbook
> mullo until sizes are a bit above twice the Karatsuba threshold, i.e
> 56 words. Quite difficult to predict which algorithm will win in the
> range 50-200 limbs, where the coefficient multiplies are using
> schoolbook or Karatsuba.

I guess there's only one way to find out... :-)

david