Allocation for toom33
Torbjorn Granlund
tg at gmplib.org
Sun Oct 25 00:23:17 CEST 2009
nisse at lysator.liu.se (Niels Möller) writes:
I've been looking a little att toom33. There's a single
TMP_ALLOC_LIMBS, allocating space for one of the evaluated values,
which I'd would like to get rid off.
Good!
This value can be moved to the product area under the assumption that
s + t >= n + 5. It would make some sense to require that of the
inputs. But unlike the requirement s + t >= 5 which I've added to
other toom functions without much hesitation, s + t >= n + 5 does not
only rule out small sizes, it rules out the borderline cases for all
sizes; inputs of size 3*n, 2*n+1 (i.e., s = n, t = 1) would be
invalid for any n.
It makes sense to mee too.
I checked the measured data to see if toom33 is anytime considered the
best function and s + t >= n + 5 does not hold. Here are the points for
a core i7:
an bn s t n
92 65: 30 + 3 < 31 + 5
93 65: 31 + 3 < 31 + 5
93 66: 31 + 4 < 31 + 5
94 67: 30 + 3 < 32 + 5
95 67: 31 + 3 < 32 + 5
96 67: 32 + 3 < 32 + 5
97 69: 31 + 3 < 33 + 5
98 68: 32 + 2 < 33 + 5
98 69: 32 + 3 < 33 + 5
99 69: 33 + 3 < 33 + 5
101 70: 33 + 2 < 34 + 5
101 71: 33 + 3 < 34 + 5
102 71: 34 + 3 < 34 + 5
103 71: 33 + 1 < 35 + 5
103 73: 33 + 3 < 35 + 5
103 74: 33 + 4 < 35 + 5
104 73: 34 + 3 < 35 + 5
104 74: 34 + 4 < 35 + 5
104 75: 34 + 5 < 35 + 5
105 73: 35 + 3 < 35 + 5
105 74: 35 + 4 < 35 + 5
107 75: 35 + 3 < 36 + 5
109 77: 35 + 3 < 37 + 5
109 78: 35 + 4 < 37 + 5
110 76: 36 + 2 < 37 + 5
110 77: 36 + 3 < 37 + 5
110 78: 36 + 4 < 37 + 5
110 79: 36 + 5 < 37 + 5
111 77: 37 + 3 < 37 + 5
111 78: 37 + 4 < 37 + 5
115 79: 37 + 1 < 39 + 5
115 81: 37 + 3 < 39 + 5
115 83: 37 + 5 < 39 + 5
116 81: 38 + 3 < 39 + 5
116 82: 38 + 4 < 39 + 5
116 83: 38 + 5 < 39 + 5
117 81: 39 + 3 < 39 + 5
117 82: 39 + 4 < 39 + 5
127 89: 41 + 3 < 43 + 5
127 90: 41 + 4 < 43 + 5
127 91: 41 + 5 < 43 + 5
128 89: 42 + 3 < 43 + 5
128 90: 42 + 4 < 43 + 5
128 91: 42 + 5 < 43 + 5
129 89: 43 + 3 < 43 + 5
129 90: 43 + 4 < 43 + 5
130 91: 42 + 3 < 44 + 5
131 91: 43 + 3 < 44 + 5
132 91: 44 + 3 < 44 + 5
133 93: 43 + 3 < 45 + 5
133 94: 43 + 4 < 45 + 5
134 93: 44 + 3 < 45 + 5
135 93: 45 + 3 < 45 + 5
135 94: 45 + 4 < 45 + 5
136 94: 44 + 2 < 46 + 5
136 97: 44 + 5 < 46 + 5
136 98: 44 + 6 < 46 + 5
137 97: 45 + 5 < 46 + 5
139 96: 45 + 2 < 47 + 5
139 97: 45 + 3 < 47 + 5
139 98: 45 + 4 < 47 + 5
139 99: 45 + 5 < 47 + 5
139 100: 45 + 6 < 47 + 5
140 97: 46 + 3 < 47 + 5
140 98: 46 + 4 < 47 + 5
140 99: 46 + 5 < 47 + 5
141 97: 47 + 3 < 47 + 5
141 98: 47 + 4 < 47 + 5
142 98: 46 + 2 < 48 + 5
142 99: 46 + 3 < 48 + 5
142 101: 46 + 5 < 48 + 5
143 100: 47 + 4 < 48 + 5
143 101: 47 + 5 < 48 + 5
144 99: 48 + 3 < 48 + 5
157 109: 51 + 3 < 53 + 5
160 114: 52 + 6 < 54 + 5
163 113: 53 + 3 < 55 + 5
164 113: 54 + 3 < 55 + 5
165 113: 55 + 3 < 55 + 5
165 114: 55 + 4 < 55 + 5
189 129: 63 + 3 < 63 + 5
189 130: 63 + 4 < 63 + 5
662 446: 220 + 4 < 221 + 5
There are similar points also for AMD K10.
I still think that would make sense. E.g., for sizes 3n, 2n+1, we have
3n = 3(n+1) - 3 and 2n+1 = 2(n+1) - 1, this would be a nice fit for
toom32 with n' = n+1.
The points look perfect for toom32, but my measurement code claims
toom33 is faster. The 2nd best function is claimed to be toom43
most of the time.
This is strange. I suppose we should investigate it. One explanation
is that I let mpn_mul be applied to the oo point (as I have mentioned a
few times) during measurements. I have yet to implement a replacement
mpn_mul in the measurement program, which uses the function the tables
suggest.
But when looking at the toom3 code, I have some other questions:
1. The comment in this itch function,
static inline mp_size_t
mpn_toom33_mul_itch (mp_size_t an, mp_size_t bn)
{
/* We could trim this to 4n+3 if HAVE_NATIVE_mpn_sublsh1_n, since
mpn_toom_interpolate_5pts only needs scratch otherwise. */
mp_size_t n = (an + 2) / (size_t) 3;
return 6 * n + GMP_NUMB_BITS;
}
As far as I can see, mpn_toom_interpolate_5pts needs no scratch space
at all. But cutting down the space to 4 an / 3 + c log an would be
nice, if possible.
I think somebody (you?) improved mpn_toom_interpolate_5pts to not use
scratch. That comment is presumably from before that.
2. The file mul_n.c contains it's own toom3 evaluation code,
mpn_toom3_mul_n. It is claimed in its comments to need 2n + c log n
scratch space, which is the same as what mpn_toom33_mul uses (but
without ant TMP_ALLOC). But the corresponding macro
MPN_TOOM3_MUL_N_TSIZE doesn't agree, it adds some extra term if
HAVE_NATIVE_mpn_sublsh1_n is missing.
The toom and kara code of mul_n should be removed.
It looks like mpn_mul_n already doesn't use the old functions, but
mpn_sqr_n does, though.
I think we should break out mpn_sqr_n to its own file, and leave just
mpn_mul_n in mul_n.c.
We should also document mpn_sqr_n, and recommend users to call that for
squaring instead of mpn_mul. (Such a recommendation should be put at the
mpn_mul entry.)
I think one should (i) eliminate the separate evaluation code in mul_n
(there's a small gain from knowing that s == t, but I doubt that's
measurable), and (ii) improve mpn_toom33_mul to really use only 2 an +
c log n temporary storage (or preferably less, of course). If
necessary, restrict inputs to sizes such that s+t>=n+5. (Similar
restrictions for other toom functions would probably reduce the need
for temporary storage. And most or all of the interpolation functions
have special case code to handle s + t <= n, which could also be
eliminated).
To me, the most important allocation improvement of the toom functions
is to avoid TMP_ALLOC. They already accept a scratch parameter, and
should ask for enough that TMP_ALLOC is never needed.
The presence of TMP_ALLOC means the activation record pointer cannot
usually be eliminated by compilers, stealing a registers from our
variables.
The second most improvement allocation improvement is to make the code
as spatially local as possible. Reducing allocation is one way, using
allocated storage wisely is another way. Small allocation size
improvements for these functions will probably not be noticeable.
--
Torbjörn
More information about the gmp-devel
mailing list