Additional memory handler features.

Wed Dec 31 08:46:55 UTC 2014

> What I asked was what problem is solved by finer granularity
> than memory with and without pointers. I kind-of dislike adding very
> gmp-specific arguments to the allocation functions, so I'd be happy to
> see a *concrete* use case where it solves a real problem.

First of all, there are few things that are already as GMP-specific
as GMP's "custom memory allocation" function pointer interfaces.
Saying you don't like to add gmp-specific arguments to interfaces
that are already completely GMP-specific is a total non-sequitur.

I think the use case has already been made, at least in general terms.

GMP currently provides a very primitive capability allowing applications
that use GMP to provide their own, simple, malloc/free/realloc-like
functions via which GMP performs (hopefully) *all* of its dynamic
memory allocation.  GMP does not require, and does not make any
assumptions about the semantics of these custom memory interfaces
beyond those that reasonable, standard-conforming implementations of
malloc/free/realloc provide.

But that does not mean certain demanding applications that use GMP
do not have assumptions of their own that they want or need to
enforce regarding memory management.  Memory may be allocated
from different places in the virtual address space, or with
different alignment requirements, by different threads, perhaps
shared between certain thread subsets, or any number of other
conceivable attributes depending upon what kind of data the memory
will contain, or usage pattern that might be expected from it.
I have previously delineated several reasonable categories of object
types that GMP creates/manipulates, and that sophisticated
memory managers might need to know and can usefully exploit.

> I believe this design was mistake, and it is under
> reconsideration. See
> https://gmplib.org/list-archives/gmp-devel/2012-January/002161.html

I disagree.  For custom memory allocation systems such as I am working with,
requiring that the exact same size be passed to all three functions (for a
given block) is both a reasonable requirement and can also be *usefully*
exploited, e.g., for performance benefits.

In my opinion, since it is apparently either inefficient or technically
infeasible for mpz_sizeinbase() to always return the exact answer, then I would
abandon its use within mpz_out_str (and its cohorts), and instead write the ASCII
data into a linked-list of temporary structures, each containing a fixed-sized
array of characters, and a "next" pointer.  The first of these buffers could be
stack-allocated, and additional nodes allocated dynamically as needed.  Once the
conversion is complete, then a correctly-sized buffer can be allocated, and the
final result copied into the final linear buffer.  This final copy operation
also allows you to take care of the fact that the printable representation is
most efficiently generated least-significant-digit(s) first -- i.e., one can
generate digits-groups into this buffer in the natural, reverse order, and
correct this order during the final copy step.  (These temporary buffers
would be "non-limb / stack temporary" type of objects.)  Better yet, fill the
buffer not with ASCII characters, but with a sequence of "base^k" digits,
where base is the caller's choice of base, and k is selected to produce digits
<= a limb in size.  This results in a significantly smaller temporary buffer,
and the conversion to individual ASCII digits would be made during the final
"right-sized" copying/reversing phase.

> Right, it's desirable that the gc is made aware of all mpz_t objects.
> But it's not obvious that additional arguments to the allocation
> function is the right way to arrange that.

It is not just a question of making GC be aware of these objects.  It is
also a question of object *placement*!  Which portion of the virtual
address space shall the requested object be allocated within?  As I have
already described, the choice quite reasonably depends upon the type of
object being requested, which quite naturally divides into (at least) the
following four distinct classes:

	- limb arrays for GMP-internal stack temporary buffers
	- limb arrays for permanent mpx_t objects
	- other sorts of stack temporary buffers (e.g., the obstack-like
	  output conversion buffers mentioned above)
	- other sorts of permanent memory (e.g., random number state
	  objects, or strings allocated dynamically by a (perhaps) as yet
	  non-existent "asprintf()-like" version of mpz_out_str() that
	  returns a dynamically allocated string instead of scribbling
	  the result into a "FILE *" stream.

Placement must be decided by the alloc() or realloc() function, which must
know the type in order to correctly decide placement of the new block.

Proper placement can be a question of correctness (maintaining application-
specific memory management invariants), performance (cache and other locality
issues), or both.

A:  When manipulating, e.g., large arrays of mpz_t objects, locality issues can
make just as big a performance improvement as the performance "hit" produced
if GMP were to automatically "down-size" limb arrays in the "bounded memory"
example previously mentioned.  Consider a large array of mpz_t objects that
contain very small (single limb) values.  If my memory manager can arrange
for all of these limb arrays to occupy consecutive memory locations, then
considerable cache locality is gained, and the performance impact of using
GMP to manipulate these numbers (instead of simple arrays of C longs, which
can suffer arithmetic overflow) is significantly reduced.  Furthermore, in
the event that several of these array elements need to be up-sized to larger
limb arrays, then a suitable garbage collector can occasionally re-copy these
limb arrays back into consecutively allocated memory, thereby restoring
locality that was lost by the resize operation.  Certain classes of
garbage collector would accomplish this simply as a fortunate by-product
of their overall method.

B: The down-sizing decision must necessarily be deferred to the application
itself.  In my case, my garbage collector could process a large set of
previously allocated/initialized mpx_t objects that my application has
marked as being "quiescent" (within its own data structures that the garbage
collector is aware of) and "right-size" all of the corresponding limb arrays.
This is a "stealing from Peter to pay Paul" technique, but used only when my
application has: (a) explicitly marked the "Peter" data structures as being
"quiescent", and (b) some *permanent* mpx_t object needs to expands its limb
array.  The cost of this shuffling need only be incurred when virtual
memory is tight, or when consideration is being given to significantly
expand the virtual memory footprint.  But this decision is made by a
the custom "alloc" or "realloc" function when there is a need to expand
the limb array of some *permanent* mpx_t object.  It would be a mistake
to do this for alloc/realloc of memory that is either non-limb or a
stack temporary.  This technique is easy to do because of the one-to-one
relationship between an mpx_t object and its limb array -- only one
reference needs to be updated when moving a permanent limb array, and since
the various mpx_t objects in question are "quiescent", we know that GMP is
not currently attempting to manipulate them -- in any thread.

A and B are just two very specific use-cases for knowing the types of objects
that GMP is requesting.  "A" needs to know limb from non-limb, both for the
initial consecutive placement, and perhaps to expand a limb array into "nearby"
space, if possible.  "B" needs to discern permanent limb from all others.  I
have many other equally specific use-cases in mind.

> The cases I'm aware of are
> gmp bindings for languages with automatic mamory management (e.g., guile
> or pike). And there, the natural way to arrange that is to have wrappers
> around mpz_init and mpz_clear that registers all long-lived gmp objects
> with the gc.

This can allow GC-registration, but isn't enough for placement decisions,
or for correctly inferring the permanent/temporary status of all limb
arrays.  (For example, are mpz_init() and mpz_clear() ever called
internally by GMP to initialize/clear mpz_t objects that are actually
stack temporaries -- e.g., for the purpose of invoking other functions that
require mpz_t pointers?)  I can control and manage state of my memory
manager at the point of call for mpz_init() and mpz_clear() -- but only
for those that occur within my own application -- I cannot control said
state at places where GMP might call these functions internally.

> I agree a special stack allocator could have a place here (in
> particular, if for some reason running with a very small execution
> stack). And it could be done internally in gmp.

Is GMP providing a custom memory allocation facility or not?

Putting such a thing inside of GMP undermines this from the application
developer's point of view.  For applications that truly want to control
issues such as cache locality, and closely manage the memory resource,
this would be a big mistake.  Passing one more "int" to the memory
allocation interfaces permits all of this, and is easily ignored by
every application that doesn't care.

Also, if you build this into GMP, what happens when that stack-oriented
region for temporary limb arrays becomes too huge to manage?  It is the
application's memory manager that should decide, not GMP.  Reductio ad
absurdum.

> But also note that small allocations are typically done on the execution
> stack, with alloca, and that for larger allocations, the computational
> cost of computing the limbs stored there is vastly larger than the
> allocation overhead, making the latter negligible.

Yes, I am aware that small temp limb-arrays go directly onto the stack, and
that only larger ones get allocated "on the heap."  Once again, it is not
just a matter of the cost of allocation.  If you are working with big
numbers that are always going to require heap-allocated temporaries,
then a custom memory manager can know to "push" and "pop" these limb
arrays from a region that it keeps just for that purpose, thereby
gaining from the fact that this region is going to be glued to the
cache, and not off in some random part of the virtual address space each
time.  But this *placement* decision can only be correctly made if the
memory manager is *given* this information (temporary limb array) by GMP.

I assert that this information cannot be correctly "bolted on" from
outside of GMP, but must originate from inside of GMP.  Each provided
mpx_t function has the potential to make *sequences* of requests through
the custom memory management interfaces that cross over several of the
categories that I have listed above, and it is not reasonable to try to
"pre-compute" what sequence of said operations that GMP is going to
execute and to try and "prime" the memory manager to respond correctly
to that predicted sequence.  The "bolt it on outside of GMP" notion is
utterly ridiculous.

> But it's not obvious that additional arguments to the allocation
> function is the right way to arrange that.

Here is an alternate proposal that does not require any new arguments
to these function signatures.  Provide instead 4 *sets* of these 3
function pointers, one for each class of object.  The existing
mp_set_memory_functions() interface would set all 4 "sets" of pointers
at once, and the existing mp_get_memory_functions() routine would
return the pointers from one of these sets.  (Pick one -- I don't really
care which, but it should be documented and standardized.)  New mp_ functions
would permit more nuanced getting and setting of the 12 pointers.  It would
then be a matter of making each alloc/free/realloc call invoke the
function from the correct set.  This would enable everything I have
asked for with little (if any) significant impact to GMP's existing
performance.  (9 additional pointers added to the working-set, and
6 of these would probably be very seldomly used, thereby effectively
removing them from the working-set of in-cache objects.)  I would further
recommend that the new API functions perform getting and setting on a
per-set basis, with an integer argument indicating which of the (initially
four) sets is being get/set.  This allows for future expansion
if new classes of objects need to be introduced -- just add a new
#define to gmp.h that provides a symbolic name for the new set.

 From an implementation point of view, it might be useful to define a
small structure containing a single set of 3 function pointers.  In
certain contexts it might be useful to pass the address of the correct
structure into lower-level functions, so that they can invoke an alloc/
free/realloc function without having to know the class of object upon
which they are operating (e.g., temporary versus permanent).  The
caller that is choosing which of 4 structures to take the address of
knows which kind of object is being manipulated and can make the
correct choice.  The new little structure can also become a standard,
supported part of the API, and make the argument lists smaller on the
newly added get/set functions to support this.

One final, potential optimization: maintain only 4 pointers to these
structures (one pointer for each object type).  Have the new get/set
methods communicate only a single structure pointer.  Setting one of
these pointers back to NULL causes GMP's internal pointer to revert
back to the address of its own internal structure containing the
default implementation, and this pointer value queries as a NULL
pointer by the "get" function.  This makes it more efficient, e.g.,
to temporarily alter the handling of one object type.  Two possible
implementation choices: does GMP copy the caller's structure into its
own internal structure, or does it remember the address of the caller's
structure and always call indirectly through the caller's structure?
Either choice works for me.  (Temporary overrides in my own application
should be thread-local, and communicated directly to my memory manager
via other means.)

There is one more potential interaction between this discussion and the
one that I split it off from.  The memory handling function pointers
*could* be treated the same way as other proposed custom handling for
exceptions.  In other words, there is the possibility of using a
dynamically-scoped mechanism.  Under his scheme, any dynamically-scoped
memory handler is used instead of the global/statically-scoped pointers,
which are used only when no dynamically-scoped version is present.

For efficiency reasons, it would be helpful to have the chain of
memory handlers be a separate chain from that used for the exception
handlers.  This allows the memory handler chain to be NULL even in the
case where several libraries are vying for control of exceptions,
resulting in many nested exception handler contexts.  Having to trace
through a long chain of exception handler contexts (only to discover
that none of them contain a handler for the "alloc temp limb array"
operation) is inefficient, and should be avoided.  Repeating this
fruitless search on each attempted allocation is also to be avoided.
Using separate chains for each mechanism avoids these problems.

One good reason to reject this suggestion, however, is that memory is
a global resource that should arguably be managed globally by the
application (not under individual, possibly conflicting objectives
set by disparate libraries).  Handling of exceptional conditions
(including out-of-memory), however, is properly the concern of each
individual library, and so it makes sense to use a dynamically-
scoped mechanism for exception handling.  Memory management, perhaps
not.  Worth a brief discussion, however, for potential symmetry and
architectural purposes.

David

On 12/24/2014 04:02 PM, Niels Möller wrote:
> "David M. Warme" <David at Warme.net> writes:
>
>>> But I'm not ocnvinced this level of granularity makes sense. Which
>>> problem does it solve?
>> Limb arrays have the potential to receive special treatment by the
>> memory manager.
> Agreed. What I asked was what problem is solved by finer granularity
> than memory with and without pointers. I kind-of dislike adding very
> gmp-specific arguments to the allocation functions, so I'd be happy to
> see a *concrete* use case where it solves a real problem.
>
>> Good design on the part of the GMP development team
>> has resulted in the existing GMP custom memory management
>> function pointers that always explicitly pass the current size, so that
>> this implicit address-to-size mapping (and its corresponding overhead)
>> is unnecessary.
> I believe this design was mistake, and it is under reconsideration. See
> https://gmplib.org/list-archives/gmp-devel/2012-January/002161.html.
>
>> a memory
>> manager could also make use of the fact that there is one-and-only-
>> one permanent reference to each limb array -- namely by the mpz_t
>> or mpf_t object that "owns" the limb array.
> Right, it's desirable that the gc is made aware of all mpz_t objects.
> But it's not obvious that additional arguments to the allocation
> function is the right way to arrange that. The cases I'm aware of are
> gmp bindings for languages with automatic mamory management (e.g., guile
> or pike). And there, the natural way to arrange that is to have wrappers
> around mpz_init and mpz_clear that registers all long-lived gmp objects
> with the gc.
>
>> The other reason concerns memory lifetime.  The stack temporaries
>> tend to have LIFO behavior with respect to allocation and freeing.
>> This is another attribute that a memory manager could exploit -- if
>> indeed it had any way of knowing that the request was for a stack
>> temporary, rather than for a longer-lived (or less predictably-lived)
>> object.
> I agree a special stack allocator could have a place here (in
> particular, if for some reason running with a very small execution
> stack). And it could be done internally in gmp.
>
> But also note that small allocations are typically done on the execution
> stack, with alloca, and that for larger allocations, the computational
> cost of computing the limbs stored there is vastly larger than the
> allocation overhead, making the latter negligible.
>
>> One of the difficulties with GMP is that the limb arrays associated
>> with mpz_t (and therefore mpq_t) objects *never* decrease in
>> size, unless the user explicitly calls mpz_realloc2().
> In case requiring explicit calls to mpz_realloc is difficult for real
> applications with "intermediate expression swell", we should consider
> doing that automatically in gmp, e.g., whenever size < alloc / 3 or so.
> And as you note, a gc can also do this automatically if it is told which
> mpz_t objects exist.
>
> Regard,
> /Niels
>