Toom symmetries

Wed Jun 10 09:00:08 CEST 2009

Ciao!

>> The simplest example, 4 points,

Unfortunately the best examples of symmetry came from the
"Toom-and-a-half", Toom-2.5, Toom-3.5 and so on (as I called them with
Zanoni). This means unbalanced only methods. The balanced always have an
odd number of points...

>>   1  0  0  0
>>   1  1  1  1
>>   1 -1  1 -1
>>   0  0  0  1
>>
>> After addsub/2, we have
>>
>>   1  0  0  0
>>   0  1  0  1
>>   1  0  1  0
>>   0  0  0  1

An addsub/2 can be useful here, currently the effect of
(a,b) <- ( (a+b)/2, (a-b)/2 )
is given by the sequence:
a <- (a+b)/2
b <-  a-b
which implies one shift only.

In other sequences I wrote, you can find something like:
a <- a+b
b <- 2*b
b <- a-b
this should be replaced by addsub (avoiding the shift)

> Hmmm... sorry to interrupt your train of thought, this gives me an
> idea. Maybe you can delay some of the division by two until after
> part of the recomposition?
[...]
> I haven't checked the operation count, but it seems plausible that
> something like this could save half a shift.

There are a lot of tricks like this, the first one in Karatsuba (due to?)
saving half an addition. There is one in Toom-3, saving half a sum again,
and it is implemented in the current[1] interpolate_5pts, another in
Toom-3.5, saving two half additions, implemented in interpolate_6pts.
But in this particular case, I can not see it, because with the currently
implemented sequence we compute one shift only, not two...

Anyway the general idea is good, mix the recomposition phase and the
interpolation, to save some half-operations. I'll give a real example I
developed some times ago (but I did not have the time to really
implement/test it).

Again, the trick works perfectly for any Toom-and-a-half, the example is
Toom-4.5, i.e. interpolation_8pts. I start from the matrix Niels wrote:

>> For the next example, say we use eight points, 0, oo, 1, -1, 2, -2,
>> 1/2, -1/2. The full matrix is
>>
>>    1     0  0  0   0   0  0    0
>>    1     1  1  1   1   1  1    1
>>    1    -1  1 -1   1  -1  1   -1
>>    1     2  4  8  16  32 64  128
>>    1    -2  4 -8  16 -32 64 -128
>>    128  64 32  16  8   4  2    1
>>    128 -64 32 -16  8  -4  2   -1
>>    0     0  0   0  0   0  0    1

Niels proposed the butterfly operations, I suggest something slightly
different, removing also the values on the first and last columns, and
with slightly different shifts, the result I need is:

   1   0  0   0  0   0  0   0
   0   1  0   1  0   1  0   0
   0   0  1   0  1   0  1   0
   0   1  0   4  0  16  0   0
   0   0  1   0  4   0 16   0
   0  16  0   4  0   1  0   0
   0   0 16   0  4   0  1   0
   0   0  0   0  0   0  0   1

As you can see, we have a lot of [x,0;0,x] blocks, this does not only mean
we have two matrices that are identical and can be interpolated with one
single function... we can do something better!
We can "recompose" couples of lines! We have:

|==== r0 =====|
      ||==== r1 =====|
             ||==== r2 =====|
                    ||==== r3 =====|
                           ||==== r4 =====|
                                  ||==== r5 =====|
                                         ||==== r6 =====|
                                                 |==== r7 =====|
and we recompose some pairs to get
|==== r0 =====|
      ||==== r1 and r2 =====|
                    ||==== r3 and r4 =====|
                                  ||==== r5 and r6 =====|
                                                 |==== r7 =====|

then we can interpolate the internal matrix

 1  1  1
 1  4 16
16  4  1

only once, operating on slices that are 3/2 times longer than the original
ones, thus saving one half on _every_ operation after this step:
additions, shifts, divisions... everything.

Moreover, interlacing also the evaluation steps, we can also save on
memory footprint: here is the strategy.

- split in slices of "n" limbs
- evaluate in {0, oo} (2n limbs each)
- for each "a"
  - evaluate in {a, -a} (2n limbs + carry, each)
  - butterfly and remove first/last column
  - partially recompose (only 3n limbs + carry, n limbs [and a carry] freed)
- interpolate the inner matrix

For example, Toom-4.5 should need for the interpolation phase: 2 values of
2n limbs and 3 values of 3n limbs, a total of 13n limbs; instead of
8*2n=16n limbs...

It should not be hard to write a single generic function doing
butterfly/remove/recompose for any {a} (if a is a power of 2).

Unfortunately, I have to underline it again... this strategy only works
for the Toom-and-a-half algorithms, I mean unbalanced only :-(

Have a nice day :-D
Marco

[1]
http://gmplib.org:8000/gmp/file/79e81acfb9b3/mpn/generic/toom_interpolate_5pts.c#l134

-- 
http://bodrato.it/toom-cook/