Toom-4.5 (aka Toom-5x4, Toom-6x3, Toom-7x2)

Thu Oct 15 11:30:46 CEST 2009

Torbjorn Granlund <tg at gmplib.org> writes:

> Where will this madness stop?  Would a toom_50_50 have a place?
> Clearly, we cannot add 100 toom function as C code.  If we need 100
> different toom functions, we need to describe the evaluation and
> interpolation steps as some sort of byte code.

I'm not sure how to compare toom vs FFT. The following is an attempt
to understand howw it works. Say we have inputs of size n and k
evaluation points (give or take a point or two, this is a rough
sketch).

Product is of size 2n, and coefficients are of size 2n/k.

Evaluation takes a polynomial of degree k/2 and evaluates it in k
points. The naive way would involve k k/2 operations on the
coefficients. Let's assume symmetry and other tricks cut it down to
half (this seems reasonable also for larger k), this gives k n / 2
limb operations for each polynomial. Pointwise multiplication is k
M(2n/k). Interpolation is k^2 operations if done by multiplying by the
inverse matrix (each operation on numbers of size 4n/k. I'm not sure
one can do a lot better for larger k, so lets say this is what it
takes. Then we have 4 k n limb operations for interpolation.

I'm assuming here that points can be chosen such that all constants
used in evaluation and interpolation fit in a limb.

So for a complete toom multiplication we have

  Linear work:           5 k n
  Multiplication:        k M(2n/k)

For FFT, we have to embed coefficients in a ring where elements are of
size 4n / k. A transform (forward or reverse) is 3k/2 log_2 k
coefficient operations (counting the add, subtract and scalar mul in a
butterfly operation equally), or 6 (log_2 k) n limb operations. Let a
multiplication in the coefficient ring take M'(4n/k). This gives

  Linear work:           18 (log_2 k) n
  Multiplication:        k M'(4n/k)

For the linear work, toom wins over fft when log_2 k <= 3 (if we
assume interpolation can be sped up factor of two, we'd get log_2 k <=
4 instead).

For the multiplication work, at best M'(4n/k) = M(2n/k) (this is when
FFT is used for the pointwise multiplications with perfect wraparound
arithmetic), so the only difference is in the linear work. But I guess
that's not so reasonable, since that is assuming thtat toom recurses
to FFT. What it shows, I think, is that FFT with a small number of
points can't beat toom.

The opposite case is schoolbook. Assume M(n) = a n^2 (a = 2
seem to be typical for x86, it's the ratio of mpn_addmul_1 to
mpn_add_n), then

Toom: 5 k n       +  4 a n^2 / k 
FFT:  18 log_2 k n + 16 a n^2 / k

Here, it's cleary useless to compare toom with k points to FFT with
the same number of points. If we let the number of points for FFT be
k' and allow k' != k, when can FFT beat toom? If we choose k' = 4k
(equalizing multiplication work), then log_2 k = 4 gives is a tie, and
FFT beats toom for log_2 k > 4.

So can we conclude anything? Either there's some reasonably small k
such that toom-N for N > k is not the best choice for any size n, or
there's a range where each toom-N is the best choice (the limit being
instead that evaluation and interpolation constants are single-limb
values).

I'm leaning towards saying that toom with more than 16 points will not
be useful (i.e., toom-8 is the highest toom that may be interesting).
I expect that for N > 8, FFT with 8N points (or 8N rounded up to a
power of two) will always beat toom-N, for any relevant input size n
(and for *large* n, FFT with a larger numbers of points will beat it
even more).

Regards,
/Niels