gmpxx and allocation
Marc Glisse
marc.glisse at inria.fr
Tue Mar 6 01:07:04 CET 2012
Hello,
just showing some numbers on a very silly example:
#include <gmpxx.h>
int main(){
mpz_class a=1;
for(int i=0;i<5000000;i++){
mpz_class b=2;
a=a+b-a*b;
}
return a.get_si()-1;
}
With gmp-5.0.4, it takes 1.10s.
With gmp-5.1, it takes 0.77s.
With a pool of objects, it takes 0.26s.
With a small object optimization, it takes 0.11s.
Discussion:
* gmp-5.1 uses fewer temporaries than 5.0 (just one less here, I didn't
take a particularly favorable case). It does in-place operations as much
as possible, which saves some allocations. gmp doesn't always like that,
and sometimes this will result in a copy (say in mpz_mul). I doubt there
is ever a significant loss compared to 5.0. If needed (for instance if we
use a pool that makes creating temporaries very cheap), we could specify
for each operation whether it likes being in-place (mpz_neg, mpz_add) or
not (mpz_sqrt, mpz_mul).
It is becoming hard to use even fewer temporaries. One possibility is
resurrecting the ternary operators (calling mpz_addmul); it uses a
temporary internally, but that's infinitely cheaper than a temporary at
the C++ level. Another possibility is merging temporaries inside an
expression: z=a*b+c*d+e*f is currently computed as tmp1=e*f, tmp2=c*d,
z=a*b, z+=tmp2, z+=tmp1. It could be done as tmp=c*d, z=a*b, z+=tmp,
tmp=e*f, z+=tmp, but that requires checking that z doesn't alias e or f
(and it is quite a bit of work). Using alloca for C++ temporaries would be
inconvenient (though probably not that hard), and it doesn't feel like the
right answer.
* For the pool of objects experiment, I have a linked list (could be
vectors, but a list is more compatible with garbage collection) of
initialized mpz_t objects, mpz_class contains just a mpz_ptr, when I
construct a mpz_class I pop an element from the list (or create a new one
if it is empty) and the destructor puts it back in the list. So mpz_init
is only called 3 times for the whole program above.
With a pool, objects are never destructed (unless you explicitly purge the
pool). Making this thread-safe is complicated: if there is a single pool,
we need to synchronize access to it, which is slow and if we have one pool
per thread, we need some working TLS and a way to clear the pool at thread
exit, all doable in C++11 but not implemented in g++ yet. Since an
mpz_class just contains an mpz_ptr, it is not possible to have
mpq_class::get_num and get_den return references to an mpz_class as they
currently do. They would have to return either a copy, or a mpz_class that
doesn't own its mpz_t (we would need to store somewhere the information
that the destructor should do nothing), or some proxy (there are traces of
a mpz_classref in the ChangeLog), but none of those would be
source-compatible with the current get_num.
Note that it would be possible to keep the current mp*_class and only use
a pool internally for the temporary objects created during the evaluation
of complex expressions.
This optimization and the previous one should have no benefit for
optimally-written code, where the user takes care to create temporaries
himself and reuses them (the pool even slows things down a little bit).
* For the small object optimization, I used flint's fmpz_t. It computes
with integers as long as numbers fit in a pointer minus a few bits and
only uses gmp when they don't (and in that case uses a pool approach).
Note that this is a classical technique (some standard libraries use it
for std::string, integers in guile (scheme) are implemented that way),
flint was just the first one I found for this test. If done properly, one
can get a speed within a small factor of builtin operations as long as the
numbers fit (although it is hard to convince the compiler to generate
optimal code without writing the whole function in asm).
The benefits are entirely limited to those small numbers. As soon as
numbers don't fit anymore, it just slows things down a little bit. With
64-bit numbers (that don't fit), the runtime increases to 0.44s whereas
all the others remain the same (flint's pool implementation is different
from mine, so the difference between .44s and .26s should not be
interpreted directly as the overhead of small object optimization).
Depending on how things are handled, pointers may be mangled which can
hurt garbage collection (or they may not, or it may not).
This technique subsumes lazy allocation (which in some sense is a small
object optimization where "small" means 0). Lazy allocation is
particularly interesting in C++11. First, it allows a constexpr noexcept 0
(ok, maybe not so useful). But mostly, it lets move construction skip an
unneeded mpz_init/mpz_free pair which makes it fast and noexcept.
There will probably be a bigint type in the next C++ standard library, and
it is likely that all implementations will use a small number
optimization. The standard might even make lazy initialization mandatory
(if it requires noexcept on the default and move constructors).
I don't think any of this will make it into gmpxx, it looks better suited
to different C++ wrappers (I didn't even mention reference counting), but
it seemed interesting to post it and see if people had any comments.
--
Marc Glisse
More information about the gmp-discuss
mailing list