Iterative Toom-Cook-2.5 (Toom32) method for unbalanced multiplication

Wed Dec 23 14:33:10 CET 2009

> Why did you chose toom32 as a primitive for very unbalanced operations.
> Which would be more efficient, tom32 or, say, toom42?

A bit more detailed explanation: the idea is to compute the coefficients of 
the product of

a(x) = a_dx^d + a_(d-1)x^(d-1) + ... a_1x + a0
b(x) = b_1x + b0

that is

c(x) = (a_db_1)x^(d+1) + ... + (a_kb0 + a_(k-1)b1)x^k +...+ (a0b0)

where each coefficient multiplication may be a balanced Toom or a basecase, or 
a FFT. With the "basecase" way there would be 2(d+1) multiplications. 
Dividing a(x) in "sections" of degree 2 each, as follows,

a(x) =  a'(x) + (a_(3m-1)x^2 + a_(3m-2)x + a_(3m-3))x^(m-1)+ ... + (a_5x^2 + 
a_4x + a_3)x^3 + (a_2x^2 + a_1x + a0)

where a'(x) is the head part: the corresponding section can have degree less 
than 2. The idea is to iterate Toom-2.5 multiplication of the j-section with 
b:

a_j(x) = (a_(3j+2)x^2 + a_(3j+1)x + a_(3j))x^j
b(x) = b_1x + b0

recomposing the (partial) result with the part computed until now. The 
multiplications numbers (4 for each sections) lowers to 4floor((d+1)/3). With 
toom42 one would have 5floor((d+1)/4), and similarly for higher Toom methods.
> One cost related to very unbalanced multiplication is the assembly of
> the toom-generated pieces.  This could be avoided by making the
> interpolation code understand the situation; it could add instead of
> writ the low n product limbs.  Do you do that?

I do simple additions to assembly.

>   The code contains two version of the functions. The first one (1)
>   considers only parity of b evaluation, the second (2) of b and of a's
>   sections. In my experiments, (1) seems to be a bit faster than (2) on
>   my architecture, but I report both versions. Maybe testing on other
>   architectures could give different results. Compile with
>
> I don't understand "of b and of a's sections".

See above. Even Toom-2.5 can be used if b+ = b0 + b1 (and therefore b_ = b0 - 
b1) is even, or a+ = a0+a1+a2 (and therefore a_ = a0-a1+a2) is even. In this 
case the decoupling step can be done in the evaluation phase with two 
additions and a shift of operands with half the length than in the 
interpolation phase

b+ = b0 + b1
b+ = b+ >> 1
b_ = b+ - b1

(or something similar for a). If this is done for b (50 % of cases), it is 
done once and for all, and therefore _all_ shifts in interpolation phases for 
all sections products disappear, and this is a concrete improvement.
If b+ is odd, one can check a+ parity, but this depends on the single section.
Averaging consider the probabilities of all events, the number of shifts is
(3m+2)/4, where m = (d+1)/3.

One can then consider code checking
(A) only b+ parity
(B) both b+ and each a-section parity

In (A), 50% of cases is covered, more in (B). Unfortunately, according to my 
experiments, the difference of (A) and (B) is not quite meaningful, but maybe 
on other architecture the result is different. Viceversa, the gain with 
respect to GMP (in well-defined regions) is meaningful.

Alberto