Allocation for toom33

Sun Oct 25 11:32:45 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

  Lets look at one of the more extreme ones:

  >  an  bn    s   t    n
  [...]
  > 142  98:  46 + 2 <  48 + 5

  So with toom33, the pointwise multiplications are: four 48 x 48, one
  46 x 2.

  For toom32 with the same inputs, n = 49, s = 44, t = 49, and we get
  three 49 x 49, one 49 x 44.

  With schoolbook for the pointwise multiplications we have

  Toom33: 4*48^2 + 46*2 = 9216 + 92 = 9308
  Toom32: 3*49^2 + 49*44 = 9457

  With Karatsuba recursing to schoolbook for the larger multiplies, we
  get instead

  Toom33: 12*24^2 + 46*2 = 6912 + 92 = 7004
  Toom32: 3*(2*25^2 + 25*24) + 2*25^2 + 24*19 = 7256

  So toom33 seems to win also in theory...

The next question is: how strong are the correlation between theory and
practice over a larger area an,bn?  I suspect it is weak, except wrt
asymptotics.  :-)

  Interesting. Trying to generalize, it seems like it's better to use
  toomMN with a very small multiplication at the infinity point, than to
  use toomM(N-1), if using the latter function implies the size of the
  balanced multiplications (i.e., n) get a limb larger.

Doesn't that also depend on whether the size is divisible by the
underlying toom primitive's splitting factor (such as even for toom22,
divisible by 3 for toom33)?

I suppose it would be interesting to plot--in projected 3D--the
behaviour of each toomMN primitive, and use that to identify quirks.  My
experience with GMP optimisation is that such quirks can hide much of
the execution time, and fixing them might be a great speedup.

And the next question is: Can we realistically take this into account?
Around the listed toom33 problem points toom32 rules.  We need to
simplify the geometry of the areas, or the toom selection code will be
more time consuming than the actual multiplication.  :-)

If we could actually model the execution time from some simple
parameters such as addmul_1 and addmul_2 times as function of n, and
mul_basecase time as function of m and n, a little as you start above,
then that might be very useful.  The algo selction formulas might not be
pretty, but from exact formulas we might be able to ask the optimisation
(in the maths sense) people for approximate simpler formulas.

  The intriguing thing is that the comment promises significantly less
  scratch space than is currently used (and less even if we would
  require s+t >= n + 5). Maybe it's just wrong.

Probably.  Or we've pessimised memory usage since we wrote it.  :-)

  > It looks like mpn_mul_n already doesn't use the old functions, but
  > mpn_sqr_n does, though.

  You'r right. But it doesn't use the new itch functions:

    else if (BELOW_THRESHOLD (n, MUL_TOOM3_THRESHOLD))
      {
        /* Allocate workspace of fixed size on stack: fast! */
        mp_limb_t ws[MPN_KARA_MUL_N_TSIZE (MUL_TOOM3_THRESHOLD_LIMIT-1)];
        ASSERT (MUL_TOOM3_THRESHOLD <= MUL_TOOM3_THRESHOLD_LIMIT);
        mpn_toom22_mul (p, a, n, b, n, ws);
      }
    else if (BELOW_THRESHOLD (n, MUL_TOOM44_THRESHOLD))
      {
        mp_ptr ws;
        TMP_SDECL;
        TMP_SMARK;
        ws = TMP_SALLOC_LIMBS (MPN_TOOM3_MUL_N_TSIZE (n));
        mpn_toom33_mul (p, a, n, b, n, ws);
        TMP_SFREE;
      }

  MPN_TOOM3_MUL_N_TSIZE can be replaced by the current itch function
  right away, but MPN_KARA_MUL_N_TSIZE probably cannot (I wouldn't
  expect a a call to an inline function to qualify as a constant
  expression, but I'm not sure how it works).

No, it is hardly a constant in a syntactic sense, which is what is
required here.

  Should (some) itch functions be macros instead?

Maybe.  I haven't thought about that.  It only really helps (I think)
when the arguments are constants.  Perhaps that's an important case.  A
problem might be that (portable) macros will require hairy code, without
any temporaries like n.

  Or should we itchify mul_n (then that particular problem disappears)?

We might want an itchifies mpn_mul_n at some point, but we cannot touch
mpn_mul_n's interface, since it is public.

  > To me, the most important allocation improvement of the toom functions
  > is to avoid TMP_ALLOC.  They already accept a scratch parameter, and
  > should ask for enough that TMP_ALLOC is never needed.

  Ok, it should be easy to increase the size claimed by
  mpn_toom33_mul_itch by n+1 (and then hope that it still works with
  current callers that bypass the itch function).

Those naughty callers deserve to have their wrists slapped!

-- 
Torbjörn