Allocation for toom33

Sat Oct 24 16:44:08 CEST 2009

I've been looking a little att toom33. There's a single
TMP_ALLOC_LIMBS, allocating space for one of the evaluated values,
which I'd would like to get rid off.

This value can be moved to the product area under the assumption that
s + t >= n + 5. It would make some sense to require that of the
inputs. But unlike the requirement s + t >= 5 which I've added to
other toom functions without much hesitation, s + t >= n + 5 does not
only rule out small sizes, it rules out the borderline cases for all
sizes; inputs of size 3*n, 2*n+1 (i.e., s = n, t = 1) would be
invalid for any n.

I still think that would make sense. E.g., for sizes 3n, 2n+1, we have
3n = 3(n+1) - 3 and 2n+1 = 2(n+1) - 1, this would be a nice fit for
toom32 with n' = n+1.

But when looking at the toom3 code, I have some other questions:

1. The comment in this itch function,

   static inline mp_size_t
   mpn_toom33_mul_itch (mp_size_t an, mp_size_t bn)
   {
     /* We could trim this to 4n+3 if HAVE_NATIVE_mpn_sublsh1_n, since
        mpn_toom_interpolate_5pts only needs scratch otherwise.  */
     mp_size_t n = (an + 2) / (size_t) 3;
     return 6 * n + GMP_NUMB_BITS;
   }

   As far as I can see, mpn_toom_interpolate_5pts needs no scratch space
   at all. But cutting down the space to 4 an / 3 + c log an would be
   nice, if possible.

2. The file mul_n.c contains it's own toom3 evaluation code,
   mpn_toom3_mul_n. It is claimed in its comments to need 2n + c log n
   scratch space, which is the same as what mpn_toom33_mul uses (but
   without ant TMP_ALLOC). But the corresponding macro
   MPN_TOOM3_MUL_N_TSIZE doesn't agree, it adds some extra term if
   HAVE_NATIVE_mpn_sublsh1_n is missing.

   Bu definition, this function rules out unbalanced inputs, but I'm
   not sure that's the only reason it can use less storage. IT also
   uses a different sequence of evaluation and poitwise
   multiplication. In particular, v0 and vinf are done last (which is
   the "obvious" way of doing it, if one ignores the way vinf[0] is
   passed separately to the interpolation function). That leaves the
   space at the low end of the product area free for temporaries for
   all computation involving the other points.

I think one should (i) eliminate the separate evaluation code in mul_n
(there's a small gain from knowing that s == t, but I doubt that's
measurable), and (ii) improve mpn_toom33_mul to really use only 2 an +
c log n temporary storage (or preferably less, of course). If
necessary, restrict inputs to sizes such that s+t>=n+5. (Similar
restrictions for other toom functions would probably reduce the need
for temporary storage. And most or all of the interpolation functions
have special case code to handle s + t <= n, which could also be
eliminated).

Comments?

Regards,
/Niels