Allocation for toom33

Sun Oct 25 20:44:08 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund <tg at gmplib.org> writes:

  > nisse at lysator.liu.se (Niels Möller) writes:
  >
  >   One other observation: They all have large s and small t. Is there any
  >   input sizes where it's optimal to choose a toom varaiant that results
  >   in a small s?
  >   
  > For toom32 I get these points with small s:
  >
  >  an  bn    s  t   n
  >  26  22:   4  11  11
  >  27  24:   3  12  12
  >  29  26:   3  13  13
  > 100  95:   4  47  48
  > 100  96:   4  48  48

  Interesting. For these cases, using toom33 instead would give a
  smaller n. E.g., consider the last one, 100x96.

  With toom32, it's three 48x48 and one 48x4.

  With toom33, we would get n = 34, s = 32, t = 28, hence four 34x34 and
  one 32x28.

  With schoolbook, it's 7104 for toom32 and 5520 for toom33, which looks
  like a pretty large difference to me. If we take into account that
  Karatsuba is used for the recursive calls, I get 5376 for toom32 and
  4172 for toom33. So more than 1000 additional limb multiplications for
  toom32.

  So if toom32 wins despite this, it must be due to the additional
  linear cost of toom33. Interpolation alone is some 14 or so linear operations
  on 69-limb numbers for toom33 (including one divby3), and 7 or so for
  toom32, working on 96-limb numbers, and evaluation is slightly more
  expensive too.

Linear costs plus constant costs related to e.g. function call overhead.

  Anyway, it looks like we can't rule out small s, that case seem to be
  important for performance at least for sizes up to a few hundred
  limbs.

I am not so sure I agree with this conclusion.  We only see this
phenomenon for isolated operand sizes.

I analysed the collected data somewehat more now: For the Core i7, 632
of 500500 possible operand sizes <= 1000 have s or t <= 3 for the best
toom.  Of these 500500 direct toom were used in 348541 cases (the rest
was schoolbook, fft, or partial toom).  The average edge for these 632
cases compared to the 2nd best algorithm was 1.1%.

For K10 the numbers are 727 of 500500 possible, direct toom were used in
341298 cases, avg speedup 1.4%.

>From these numbers I'd say: Let's not handle tiny s or t. If we make the
code 0.01% faster over the entire range by not handling that, there will
be an average speedup of GMP multiply!

You're welcome to check/run my code to make sure I didn't make a mistake
with my quick hack.

-- 
Torbjörn