Allocation for toom33

Sun Oct 25 15:55:03 CET 2009

bodrato at mail.dm.unipi.it writes:

  Hi,

  >   The intriguing thing is that the comment promises significantly less
  >   scratch space than is currently used (and less even if we would
  >   require s+t >= n + 5). Maybe it's just wrong.
  >
  > Probably.  Or we've pessimised memory usage since we wrote it.  :-)

  I believe that mpn_toom3_mul_n (the "old" balanced only function) has a
  better memory usage than the newer mpn_toom33_mul. This is not because of
  the added flexibility of the unbalanced function, but depend on the
  sequence of evaluations.

  toom33 compute at first all the evaluations:
  a(1), a(-1), a(2), b(1), b(-1), b(2);
  then the products:
  c(1) <-a(1)*b(1) ...

  The old toom3_mul_n interlaces the operations above, possibly overwriting
  a(1) when it computes a(2)... Maybe the comment comes from the old
  implementation... Maybe we should better merge the ideas of the two
  versions before removing the "old" one.

We should write a temp area optimiser.  Idea:

For the toom functions, we use many temporary values, with well-defined
sizes as functions of n, s, and t.

The values form a data dependency DAG, ending in the final result.  The
source nodes are the input polynomial coefficients (most n in size, one
s large and one t large).

Optimising memory means reusing temporary areas cleverly.  (sic)

Finding an optimal memory use seems be related to the topological
sorting problem.  But why should we try to be more clever than is needed
for finding an optimal solution?  Perhaps we should simply randomly pick
a valid execution order, let a chosen execution pick a random free
scratch area?

To chose a valid next execution, If we keep track if k positions in the
DAG where the execution has arrived.  Then generate a random number
1...k and--if that position has all of its input data ready--advance
that particular computation to the next node.

One need to define what a "random free scratch area is", since terrible
choices might make the algorithm ineffective.  I think one should define
each scratch area as n+1 in size, except for ones in the product area
which need to be n each with a final one s+t large at position (M+N-1)n.

This random walk is then repeated a few billion times, saving the best
walks.

Of course, making the thing non-deterministic without any randomness is
also possible.  One just have to enumerate the choices at each state,
and then walk through then one after another.  But in a large search
space, such a search might take too long before it finds any good
solutions; the search begins "in a corner" and might not get far from it
until one runs out of patience.

I think this would be a good master's thesis project in the CS field.

-- 
Torbjörn