Toom symmetries

Wed Jun 10 22:07:30 CEST 2009

bodrato at mail.dm.unipi.it writes:

> As you can see, we have a lot of [x,0;0,x] blocks, this does not only mean
> we have two matrices that are identical and can be interpolated with one
> single function... we can do something better!
> We can "recompose" couples of lines! We have:

That's clever. To see if I understand this, let's consider the simplest
four-point interpolation first. The number we're trying to construct
is x=f(B), with

  f(t) = x_0 + x_1 t + x_2 t^2 + x_3 t^3

and B is the power of two used for splitting. To start with, we are
given

  w0 = f(0) = x0
  w1 = f(1) = x0 + x1 + x2 + x3
  w2 = f(-1) = x0 - x1 + x2 - x3
  w3 = f(oo) = x3

First, a butterfly operation, replacing w1 and w2 by

  w1 = x1 + x3
  w2 = x0 + x2

Then x2 = w2 - w0 and x1 = w1 - w3, so we can write

  x = w0 + (w1 - w3) B + (w2 - w0) B^2 + w3 B^3
    = -(w3 B - w0) + (w1 + w2 B) B + (w3 B - w0) B^2

Let L w0 and H w0 be the low and high part of w0. Define

  y = w3 - H w0
  z = w1 + B w2

Then we have

  x = Lw0 - y B + z B + y B^2

or graphically
      B^4  B^3  B^2   B    1
  |    |    |    |    |    |
            +---------+----+
            |   - y   |Lw0 |
      +-----+---------+
      |      z        |
  +---+---------------+
  |   y     |-Lw0|
  +---------+----+

So the inputs to the recombination, at this stage, are y (2n limbs +
sign), z (2n+1 limbs) and Lw0 (n limbs), for a total of 5n + 1 limbs.
Compared to 8*n + 2 needed to represent the original w:s. We need
scratch space for the recursive computations, and for the storage of
z, but no more. (To compute z, put factors in the product area, first
product in z, second in the product area, then do the butterflies and
combination. w0 and w3 can be computed directly into the product area
and combined into y.

I don't fully understand the current toom32_mul, but maybe it's doing
precisely this? The temporary allocation makes me suspicious, though:

toom32_mul_itch returns 4*n + 2. toom32_mul does this,

  #define scratch_out	scratch + 4 * n + 2
  [...]
  TOOM22_MUL_N_REC (vm1, asm1, bsm1, n, scratch_out);

The recursive call is to toom22_mul, which needs ~ 2n limbs of scratch
space, according to toom22_mul_itch, and this area is beyond the
scratch area that toom32_mul asks for.

Regards,
/Niels