Allocation for toom33

Sun Oct 25 10:50:31 CET 2009

Torbjorn Granlund <tg at gmplib.org> writes:

> I checked the measured data to see if toom33 is anytime considered the
> best function and s + t >= n + 5 does not hold.

> The points look perfect for toom32, but my measurement code claims
> toom33 is faster.  The 2nd best function is claimed to be toom43
> most of the time.
>
> This is strange.  I suppose we should investigate it.  One explanation
> is that I let mpn_mul be applied to the oo point (as I have mentioned a
> few times) during measurements.  I have yet to implement a replacement
> mpn_mul in the measurement program, which uses the function the tables
> suggest.

Lets look at one of the more extreme ones:

>  an  bn    s   t    n
[...]
> 142  98:  46 + 2 <  48 + 5

So with toom33, the pointwise multiplications are: four 48 x 48, one
46 x 2.

For toom32 with the same inputs, n = 49, s = 44, t = 49, and we get
three 49 x 49, one 49 x 44.

With schoolbook for the pointwise multiplications we have

Toom33: 4*48^2 + 46*2 = 9216 + 92 = 9308
Toom32: 3*49^2 + 49*44 = 9457

With Karatsuba recursing to schoolbook for the larger multiplies, we
get instead

Toom33: 12*24^2 + 46*2 = 6912 + 92 = 7004
Toom32: 3*(2*25^2 + 25*24) + 2*25^2 + 24*19 = 7256

So toom33 seems to win also in theory...

Interesting. Trying to generalize, it seems like it's better to use
toomMN with a very small multiplication at the infinity point, than to
use toomM(N-1), if using the latter function implies the size of the
balanced multiplications (i.e., n) get a limb larger.

And now I see that there's an even more extreme point on the list,
with t = 1:

> 115  79:  37 + 1 <  39 + 5

Toom33: 4*39^2 + 37 = 6084 + 37 = 6121
Toom32: 3*40^2 + 39*35 = 6165

Here, toom33 looks *slightly* better in theory, but I guess it has a
worse linear term.

>        /* We could trim this to 4n+3 if HAVE_NATIVE_mpn_sublsh1_n, since
>           mpn_toom_interpolate_5pts only needs scratch otherwise.  */
>    
>      As far as I can see, mpn_toom_interpolate_5pts needs no scratch space
>      at all. But cutting down the space to 4 an / 3 + c log an would be
>      nice, if possible.

The intriguing thing is that the comment promises significantly less
scratch space than is currently used (and less even if we would
require s+t >= n + 5). Maybe it's just wrong.

> It looks like mpn_mul_n already doesn't use the old functions, but
> mpn_sqr_n does, though.

You'r right. But it doesn't use the new itch functions:

  else if (BELOW_THRESHOLD (n, MUL_TOOM3_THRESHOLD))
    {
      /* Allocate workspace of fixed size on stack: fast! */
      mp_limb_t ws[MPN_KARA_MUL_N_TSIZE (MUL_TOOM3_THRESHOLD_LIMIT-1)];
      ASSERT (MUL_TOOM3_THRESHOLD <= MUL_TOOM3_THRESHOLD_LIMIT);
      mpn_toom22_mul (p, a, n, b, n, ws);
    }
  else if (BELOW_THRESHOLD (n, MUL_TOOM44_THRESHOLD))
    {
      mp_ptr ws;
      TMP_SDECL;
      TMP_SMARK;
      ws = TMP_SALLOC_LIMBS (MPN_TOOM3_MUL_N_TSIZE (n));
      mpn_toom33_mul (p, a, n, b, n, ws);
      TMP_SFREE;
    }

MPN_TOOM3_MUL_N_TSIZE can be replaced by the current itch function
right away, but MPN_KARA_MUL_N_TSIZE probably cannot (I wouldn't
expect a a call to an inline function to qualify as a constant
expression, but I'm not sure how it works).

Should (some) itch functions be macros instead? Or should we itchify
mul_n (then that particular problem disappears)?

> To me, the most important allocation improvement of the toom functions
> is to avoid TMP_ALLOC.  They already accept a scratch parameter, and
> should ask for enough that TMP_ALLOC is never needed.

Ok, it should be easy to increase the size claimed by
mpn_toom33_mul_itch by n+1 (and then hope that it still works with
current callers that bypass the itch function).

Regards,
/Niels