Wraparound multiplicaton vs mullo
dmharvey at cims.nyu.edu
Wed Sep 30 16:45:44 CEST 2009
On Sep 30, 2009, at 9:59 AM, Niels Möller wrote:
> David Harvey <dmharvey at cims.nyu.edu> writes:
>>> * k = 1, p(x) = x^2 + 1, w = ±i (imaginary unit)
>>> We'd compute a(i) = a_0 + a_1 i, b(i) = b_0 + i b_1. We don't
>>> need to
>>> evaluate in a(-i), since a(-i) = conjugate a(i) (real coefficients).
>>> So we get only 1 complex mutliplication, which can be done as 3 real
>>> multiplications in the coefficient ring. Which is no saving compared
>>> to a full Karatsuba product.
>> I think because x^2 + 1 is irreducible over Q. If your coefficient
>> ring contained the roots of p(x), then you could do it with only two
> The way I think of it, coefficients are in the ring Z, I extend this
> with the imaginary unit, to get Z[i] which is the same as Z[x] / (x^2
> + 1). (If one extends to the quotient ring, it also happens to be a
> field, since x^2 + 1 is irreducible, but I don't think field
> properties are important, all we need is that particular elements that
> turn up in the interpolation stage are invertible).
> Do you have some other extension or embedding of the coefficients in
No, no other embedding in mind. It's just that you seemed to be
talking about polynomial arithmetic in general, and I was just
pointing out the fact (that you already knew) that the number of
coefficient operations would depend on the coefficient ring. Over Z
you are correct, 3 mults are required.
> Or are you thinking of some trick to reduce the multiplication in Z[x]
> / (x^2 + x + 1) to 3, rather than 4, real multiplications? Maybe it
> follows directly in some way I don't understand from the corresponding
> trick for Z[x] / (x^2 + 1) ?
you are making this more complicated than necessary :-)
They are linear polynomials. Multiply them using three mults using
karatsuba. Then reduce modulo x^2 + x + 1; this needs only additions
Something else to think about. Suppose you use the modulus (x-1)(x-2)
(x-3)(x-4). Multiplication modulo this costs 4 M(n/4) + O(n). The O
(n) term is uncomfortably large. But maybe one could build a nice
REDC on top of this anyway. Of course 1, 2, 3, 4 might not be the
best evaluation points to use.
>> That's a really good question. I'm pretty sure mod 2^n is faster in
>> the basecase region.
> For the smallest sizes, basecase mullo ought to win, but it's not
> obvious to me how the thresholds for the x^4 - 1 algorithm relates
> to the
> Karatsuba and Toom thresholds.
Not obvious to me either.
One huge complication is that multiplication modulo 2^n introduces a
nasty branch misprediction, since the accumulation regions are
triangles everywhere, leading to a very large extra O(n) term. Maybe
Torbjorn found a clever way to deal with this in the squaring code?
> Schoolbook mullo takes 0.5 M(n) = 0.5 n^2. The x^2 - 1 has the same
> quadratic term but larger linear overhead, so it can't be faster.
Oops yes you are right (except possibly there's the branch
> x^4 - 1 algorithm takes 0.31 n^2 + O(n) (if I got thtat calculation
Yes, it's 5/16 n^2.
> So it *might* be faster even below the Karatsuba threshold,
> depending on the linear koefficient in the running time.
> On x86_64, the Karatsuba threshold seems to be 28 limbs, so the x^4
> - 1
> algorithm will use schoolbook for the recursive calls up to n = 112
> limbs. And the x^2 - 1 algorithm cannot be faster than schoolbook
> mullo until sizes are a bit above twice the Karatsuba threshold, i.e
> 56 words. Quite difficult to predict which algorithm will win in the
> range 50-200 limbs, where the coefficient multiplies are using
> schoolbook or Karatsuba.
I guess there's only one way to find out... :-)
More information about the gmp-devel