Allocation for toom33
nisse at lysator.liu.se
Sun Oct 25 10:50:31 CET 2009
Torbjorn Granlund <tg at gmplib.org> writes:
> I checked the measured data to see if toom33 is anytime considered the
> best function and s + t >= n + 5 does not hold.
> The points look perfect for toom32, but my measurement code claims
> toom33 is faster. The 2nd best function is claimed to be toom43
> most of the time.
> This is strange. I suppose we should investigate it. One explanation
> is that I let mpn_mul be applied to the oo point (as I have mentioned a
> few times) during measurements. I have yet to implement a replacement
> mpn_mul in the measurement program, which uses the function the tables
Lets look at one of the more extreme ones:
> an bn s t n
> 142 98: 46 + 2 < 48 + 5
So with toom33, the pointwise multiplications are: four 48 x 48, one
46 x 2.
For toom32 with the same inputs, n = 49, s = 44, t = 49, and we get
three 49 x 49, one 49 x 44.
With schoolbook for the pointwise multiplications we have
Toom33: 4*48^2 + 46*2 = 9216 + 92 = 9308
Toom32: 3*49^2 + 49*44 = 9457
With Karatsuba recursing to schoolbook for the larger multiplies, we
Toom33: 12*24^2 + 46*2 = 6912 + 92 = 7004
Toom32: 3*(2*25^2 + 25*24) + 2*25^2 + 24*19 = 7256
So toom33 seems to win also in theory...
Interesting. Trying to generalize, it seems like it's better to use
toomMN with a very small multiplication at the infinity point, than to
use toomM(N-1), if using the latter function implies the size of the
balanced multiplications (i.e., n) get a limb larger.
And now I see that there's an even more extreme point on the list,
with t = 1:
> 115 79: 37 + 1 < 39 + 5
Toom33: 4*39^2 + 37 = 6084 + 37 = 6121
Toom32: 3*40^2 + 39*35 = 6165
Here, toom33 looks *slightly* better in theory, but I guess it has a
worse linear term.
> /* We could trim this to 4n+3 if HAVE_NATIVE_mpn_sublsh1_n, since
> mpn_toom_interpolate_5pts only needs scratch otherwise. */
> As far as I can see, mpn_toom_interpolate_5pts needs no scratch space
> at all. But cutting down the space to 4 an / 3 + c log an would be
> nice, if possible.
The intriguing thing is that the comment promises significantly less
scratch space than is currently used (and less even if we would
require s+t >= n + 5). Maybe it's just wrong.
> It looks like mpn_mul_n already doesn't use the old functions, but
> mpn_sqr_n does, though.
You'r right. But it doesn't use the new itch functions:
else if (BELOW_THRESHOLD (n, MUL_TOOM3_THRESHOLD))
/* Allocate workspace of fixed size on stack: fast! */
mp_limb_t ws[MPN_KARA_MUL_N_TSIZE (MUL_TOOM3_THRESHOLD_LIMIT-1)];
ASSERT (MUL_TOOM3_THRESHOLD <= MUL_TOOM3_THRESHOLD_LIMIT);
mpn_toom22_mul (p, a, n, b, n, ws);
else if (BELOW_THRESHOLD (n, MUL_TOOM44_THRESHOLD))
ws = TMP_SALLOC_LIMBS (MPN_TOOM3_MUL_N_TSIZE (n));
mpn_toom33_mul (p, a, n, b, n, ws);
MPN_TOOM3_MUL_N_TSIZE can be replaced by the current itch function
right away, but MPN_KARA_MUL_N_TSIZE probably cannot (I wouldn't
expect a a call to an inline function to qualify as a constant
expression, but I'm not sure how it works).
Should (some) itch functions be macros instead? Or should we itchify
mul_n (then that particular problem disappears)?
> To me, the most important allocation improvement of the toom functions
> is to avoid TMP_ALLOC. They already accept a scratch parameter, and
> should ask for enough that TMP_ALLOC is never needed.
Ok, it should be easy to increase the size claimed by
mpn_toom33_mul_itch by n+1 (and then hope that it still works with
current callers that bypass the itch function).
More information about the gmp-devel