Iterative Toom-Cook-2.5 (Toom32) method for unbalanced multiplication
Alberto Zanoni
zanoni at volterra.uniroma2.it
Wed Dec 23 14:33:10 CET 2009
> Why did you chose toom32 as a primitive for very unbalanced operations.
> Which would be more efficient, tom32 or, say, toom42?
A bit more detailed explanation: the idea is to compute the coefficients of
the product of
a(x) = a_dx^d + a_(d-1)x^(d-1) + ... a_1x + a0
b(x) = b_1x + b0
that is
c(x) = (a_db_1)x^(d+1) + ... + (a_kb0 + a_(k-1)b1)x^k +...+ (a0b0)
where each coefficient multiplication may be a balanced Toom or a basecase, or
a FFT. With the "basecase" way there would be 2(d+1) multiplications.
Dividing a(x) in "sections" of degree 2 each, as follows,
a(x) = a'(x) + (a_(3m-1)x^2 + a_(3m-2)x + a_(3m-3))x^(m-1)+ ... + (a_5x^2 +
a_4x + a_3)x^3 + (a_2x^2 + a_1x + a0)
where a'(x) is the head part: the corresponding section can have degree less
than 2. The idea is to iterate Toom-2.5 multiplication of the j-section with
b:
a_j(x) = (a_(3j+2)x^2 + a_(3j+1)x + a_(3j))x^j
b(x) = b_1x + b0
recomposing the (partial) result with the part computed until now. The
multiplications numbers (4 for each sections) lowers to 4floor((d+1)/3). With
toom42 one would have 5floor((d+1)/4), and similarly for higher Toom methods.
> One cost related to very unbalanced multiplication is the assembly of
> the toom-generated pieces. This could be avoided by making the
> interpolation code understand the situation; it could add instead of
> writ the low n product limbs. Do you do that?
I do simple additions to assembly.
> The code contains two version of the functions. The first one (1)
> considers only parity of b evaluation, the second (2) of b and of a's
> sections. In my experiments, (1) seems to be a bit faster than (2) on
> my architecture, but I report both versions. Maybe testing on other
> architectures could give different results. Compile with
>
> I don't understand "of b and of a's sections".
See above. Even Toom-2.5 can be used if b+ = b0 + b1 (and therefore b_ = b0 -
b1) is even, or a+ = a0+a1+a2 (and therefore a_ = a0-a1+a2) is even. In this
case the decoupling step can be done in the evaluation phase with two
additions and a shift of operands with half the length than in the
interpolation phase
b+ = b0 + b1
b+ = b+ >> 1
b_ = b+ - b1
(or something similar for a). If this is done for b (50 % of cases), it is
done once and for all, and therefore _all_ shifts in interpolation phases for
all sections products disappear, and this is a concrete improvement.
If b+ is odd, one can check a+ parity, but this depends on the single section.
Averaging consider the probabilities of all events, the number of shifts is
(3m+2)/4, where m = (d+1)/3.
One can then consider code checking
(A) only b+ parity
(B) both b+ and each a-section parity
In (A), 50% of cases is covered, more in (B). Unfortunately, according to my
experiments, the difference of (A) and (B) is not quite meaningful, but maybe
on other architecture the result is different. Viceversa, the gain with
respect to GMP (in well-defined regions) is meaningful.
Alberto
More information about the gmp-devel
mailing list