Allocation for toom33

Sun Oct 25 00:23:17 CEST 2009

nisse at lysator.liu.se (Niels Möller) writes:

  I've been looking a little att toom33. There's a single
  TMP_ALLOC_LIMBS, allocating space for one of the evaluated values,
  which I'd would like to get rid off.

Good!

  This value can be moved to the product area under the assumption that
  s + t >= n + 5. It would make some sense to require that of the
  inputs. But unlike the requirement s + t >= 5 which I've added to
  other toom functions without much hesitation, s + t >= n + 5 does not
  only rule out small sizes, it rules out the borderline cases for all
  sizes; inputs of size 3*n, 2*n+1 (i.e., s = n, t = 1) would be
  invalid for any n.

It makes sense to mee too.

I checked the measured data to see if toom33 is anytime considered the
best function and s + t >= n + 5 does not hold.  Here are the points for
a core i7:

 an  bn    s   t    n
 92  65:  30 + 3 <  31 + 5
 93  65:  31 + 3 <  31 + 5
 93  66:  31 + 4 <  31 + 5
 94  67:  30 + 3 <  32 + 5
 95  67:  31 + 3 <  32 + 5
 96  67:  32 + 3 <  32 + 5
 97  69:  31 + 3 <  33 + 5
 98  68:  32 + 2 <  33 + 5
 98  69:  32 + 3 <  33 + 5
 99  69:  33 + 3 <  33 + 5
101  70:  33 + 2 <  34 + 5
101  71:  33 + 3 <  34 + 5
102  71:  34 + 3 <  34 + 5
103  71:  33 + 1 <  35 + 5
103  73:  33 + 3 <  35 + 5
103  74:  33 + 4 <  35 + 5
104  73:  34 + 3 <  35 + 5
104  74:  34 + 4 <  35 + 5
104  75:  34 + 5 <  35 + 5
105  73:  35 + 3 <  35 + 5
105  74:  35 + 4 <  35 + 5
107  75:  35 + 3 <  36 + 5
109  77:  35 + 3 <  37 + 5
109  78:  35 + 4 <  37 + 5
110  76:  36 + 2 <  37 + 5
110  77:  36 + 3 <  37 + 5
110  78:  36 + 4 <  37 + 5
110  79:  36 + 5 <  37 + 5
111  77:  37 + 3 <  37 + 5
111  78:  37 + 4 <  37 + 5
115  79:  37 + 1 <  39 + 5
115  81:  37 + 3 <  39 + 5
115  83:  37 + 5 <  39 + 5
116  81:  38 + 3 <  39 + 5
116  82:  38 + 4 <  39 + 5
116  83:  38 + 5 <  39 + 5
117  81:  39 + 3 <  39 + 5
117  82:  39 + 4 <  39 + 5
127  89:  41 + 3 <  43 + 5
127  90:  41 + 4 <  43 + 5
127  91:  41 + 5 <  43 + 5
128  89:  42 + 3 <  43 + 5
128  90:  42 + 4 <  43 + 5
128  91:  42 + 5 <  43 + 5
129  89:  43 + 3 <  43 + 5
129  90:  43 + 4 <  43 + 5
130  91:  42 + 3 <  44 + 5
131  91:  43 + 3 <  44 + 5
132  91:  44 + 3 <  44 + 5
133  93:  43 + 3 <  45 + 5
133  94:  43 + 4 <  45 + 5
134  93:  44 + 3 <  45 + 5
135  93:  45 + 3 <  45 + 5
135  94:  45 + 4 <  45 + 5
136  94:  44 + 2 <  46 + 5
136  97:  44 + 5 <  46 + 5
136  98:  44 + 6 <  46 + 5
137  97:  45 + 5 <  46 + 5
139  96:  45 + 2 <  47 + 5
139  97:  45 + 3 <  47 + 5
139  98:  45 + 4 <  47 + 5
139  99:  45 + 5 <  47 + 5
139 100:  45 + 6 <  47 + 5
140  97:  46 + 3 <  47 + 5
140  98:  46 + 4 <  47 + 5
140  99:  46 + 5 <  47 + 5
141  97:  47 + 3 <  47 + 5
141  98:  47 + 4 <  47 + 5
142  98:  46 + 2 <  48 + 5
142  99:  46 + 3 <  48 + 5
142 101:  46 + 5 <  48 + 5
143 100:  47 + 4 <  48 + 5
143 101:  47 + 5 <  48 + 5
144  99:  48 + 3 <  48 + 5
157 109:  51 + 3 <  53 + 5
160 114:  52 + 6 <  54 + 5
163 113:  53 + 3 <  55 + 5
164 113:  54 + 3 <  55 + 5
165 113:  55 + 3 <  55 + 5
165 114:  55 + 4 <  55 + 5
189 129:  63 + 3 <  63 + 5
189 130:  63 + 4 <  63 + 5
662 446: 220 + 4 < 221 + 5

There are similar points also for AMD K10.

  I still think that would make sense. E.g., for sizes 3n, 2n+1, we have
  3n = 3(n+1) - 3 and 2n+1 = 2(n+1) - 1, this would be a nice fit for
  toom32 with n' = n+1.

The points look perfect for toom32, but my measurement code claims
toom33 is faster.  The 2nd best function is claimed to be toom43
most of the time.

This is strange.  I suppose we should investigate it.  One explanation
is that I let mpn_mul be applied to the oo point (as I have mentioned a
few times) during measurements.  I have yet to implement a replacement
mpn_mul in the measurement program, which uses the function the tables
suggest.

  But when looking at the toom3 code, I have some other questions:

  1. The comment in this itch function,

     static inline mp_size_t
     mpn_toom33_mul_itch (mp_size_t an, mp_size_t bn)
     {
       /* We could trim this to 4n+3 if HAVE_NATIVE_mpn_sublsh1_n, since
          mpn_toom_interpolate_5pts only needs scratch otherwise.  */
       mp_size_t n = (an + 2) / (size_t) 3;
       return 6 * n + GMP_NUMB_BITS;
     }

     As far as I can see, mpn_toom_interpolate_5pts needs no scratch space
     at all. But cutting down the space to 4 an / 3 + c log an would be
     nice, if possible.

I think somebody (you?) improved mpn_toom_interpolate_5pts to not use
scratch.  That comment is presumably from before that.

  2. The file mul_n.c contains it's own toom3 evaluation code,
     mpn_toom3_mul_n. It is claimed in its comments to need 2n + c log n
     scratch space, which is the same as what mpn_toom33_mul uses (but
     without ant TMP_ALLOC). But the corresponding macro
     MPN_TOOM3_MUL_N_TSIZE doesn't agree, it adds some extra term if
     HAVE_NATIVE_mpn_sublsh1_n is missing.

The toom and kara code of mul_n should be removed.

It looks like mpn_mul_n already doesn't use the old functions, but
mpn_sqr_n does, though.

I think we should break out mpn_sqr_n to its own file, and leave just
mpn_mul_n in mul_n.c.

We should also document mpn_sqr_n, and recommend users to call that for
squaring instead of mpn_mul.  (Such a recommendation should be put at the
mpn_mul entry.)

  I think one should (i) eliminate the separate evaluation code in mul_n
  (there's a small gain from knowing that s == t, but I doubt that's
  measurable), and (ii) improve mpn_toom33_mul to really use only 2 an +
  c log n temporary storage (or preferably less, of course). If
  necessary, restrict inputs to sizes such that s+t>=n+5. (Similar
  restrictions for other toom functions would probably reduce the need
  for temporary storage. And most or all of the interpolation functions
  have special case code to handle s + t <= n, which could also be
  eliminated).

To me, the most important allocation improvement of the toom functions
is to avoid TMP_ALLOC.  They already accept a scratch parameter, and
should ask for enough that TMP_ALLOC is never needed.

The presence of TMP_ALLOC means the activation record pointer cannot
usually be eliminated by compilers, stealing a registers from our
variables.

The second most improvement allocation improvement is to make the code
as spatially local as possible.  Reducing allocation is one way, using
allocated storage wisely is another way.  Small allocation size
improvements for these functions will probably not be noticeable.

-- 
Torbjörn