gmpxx and allocation

Tue Mar 6 01:07:04 CET 2012

Hello,

just showing some numbers on a very silly example:

#include <gmpxx.h>

int main(){
   mpz_class a=1;
   for(int i=0;i<5000000;i++){
     mpz_class b=2;
     a=a+b-a*b;
   }
   return a.get_si()-1;
}

With gmp-5.0.4, it takes 1.10s.
With gmp-5.1, it takes 0.77s.
With a pool of objects, it takes 0.26s.
With a small object optimization, it takes 0.11s.

Discussion:

* gmp-5.1 uses fewer temporaries than 5.0 (just one less here, I didn't 
take a particularly favorable case). It does in-place operations as much 
as possible, which saves some allocations. gmp doesn't always like that, 
and sometimes this will result in a copy (say in mpz_mul). I doubt there 
is ever a significant loss compared to 5.0. If needed (for instance if we 
use a pool that makes creating temporaries very cheap), we could specify 
for each operation whether it likes being in-place (mpz_neg, mpz_add) or 
not (mpz_sqrt, mpz_mul).

It is becoming hard to use even fewer temporaries. One possibility is 
resurrecting the ternary operators (calling mpz_addmul); it uses a 
temporary internally, but that's infinitely cheaper than a temporary at 
the C++ level. Another possibility is merging temporaries inside an 
expression: z=a*b+c*d+e*f is currently computed as tmp1=e*f, tmp2=c*d, 
z=a*b, z+=tmp2, z+=tmp1. It could be done as tmp=c*d, z=a*b, z+=tmp, 
tmp=e*f, z+=tmp, but that requires checking that z doesn't alias e or f 
(and it is quite a bit of work). Using alloca for C++ temporaries would be 
inconvenient (though probably not that hard), and it doesn't feel like the 
right answer.

* For the pool of objects experiment, I have a linked list (could be 
vectors, but a list is more compatible with garbage collection) of 
initialized mpz_t objects, mpz_class contains just a mpz_ptr, when I 
construct a mpz_class I pop an element from the list (or create a new one 
if it is empty) and the destructor puts it back in the list. So mpz_init 
is only called 3 times for the whole program above.

With a pool, objects are never destructed (unless you explicitly purge the 
pool). Making this thread-safe is complicated: if there is a single pool, 
we need to synchronize access to it, which is slow and if we have one pool 
per thread, we need some working TLS and a way to clear the pool at thread 
exit, all doable in C++11 but not implemented in g++ yet. Since an 
mpz_class just contains an mpz_ptr, it is not possible to have 
mpq_class::get_num and get_den return references to an mpz_class as they 
currently do. They would have to return either a copy, or a mpz_class that 
doesn't own its mpz_t (we would need to store somewhere the information 
that the destructor should do nothing), or some proxy (there are traces of 
a mpz_classref in the ChangeLog), but none of those would be 
source-compatible with the current get_num.

Note that it would be possible to keep the current mp*_class and only use 
a pool internally for the temporary objects created during the evaluation 
of complex expressions.

This optimization and the previous one should have no benefit for 
optimally-written code, where the user takes care to create temporaries 
himself and reuses them (the pool even slows things down a little bit).

* For the small object optimization, I used flint's fmpz_t. It computes 
with integers as long as numbers fit in a pointer minus a few bits and 
only uses gmp when they don't (and in that case uses a pool approach). 
Note that this is a classical technique (some standard libraries use it 
for std::string, integers in guile (scheme) are implemented that way), 
flint was just the first one I found for this test. If done properly, one 
can get a speed within a small factor of builtin operations as long as the 
numbers fit (although it is hard to convince the compiler to generate 
optimal code without writing the whole function in asm).

The benefits are entirely limited to those small numbers. As soon as 
numbers don't fit anymore, it just slows things down a little bit. With 
64-bit numbers (that don't fit), the runtime increases to 0.44s whereas 
all the others remain the same (flint's pool implementation is different 
from mine, so the difference between .44s and .26s should not be 
interpreted directly as the overhead of small object optimization). 
Depending on how things are handled, pointers may be mangled which can 
hurt garbage collection (or they may not, or it may not).

This technique subsumes lazy allocation (which in some sense is a small 
object optimization where "small" means 0). Lazy allocation is 
particularly interesting in C++11. First, it allows a constexpr noexcept 0 
(ok, maybe not so useful). But mostly, it lets move construction skip an 
unneeded mpz_init/mpz_free pair which makes it fast and noexcept.

There will probably be a bigint type in the next C++ standard library, and 
it is likely that all implementations will use a small number 
optimization. The standard might even make lazy initialization mandatory 
(if it requires noexcept on the default and move constructors).

I don't think any of this will make it into gmpxx, it looks better suited 
to different C++ wrappers (I didn't even mention reference counting), but 
it seemed interesting to post it and see if people had any comments.

-- 
Marc Glisse