Memory usage for large multiplications

Wed Feb 3 11:06:56 CET 2010

bodrato at mail.dm.unipi.it writes:

  Ciao,

  > FYI, Linux has the ability to overallocate memory, allocating it
  > really only when needed. In such a case, the former strategy is
  > better. But I wouldn't recommend it, as in case of lack of memory,
  > expect random processes to be killed by the OOM Killer. :(

  The OOM killer can kill a process even in the "just in time allocation"
  scenario, when two processes concurrently ask for memory.

  By the way, the kernel can delay real allocation, but it can not detect
  when memory is not needed any more and free it.
  Maybe the itch/scratch strategy should be limited to a few megabytes (or
  to the small-to-middle-sizes algorithms)? And bigger memory areas should
  be always allocated and freed on the fly?

We need to make a few things clear to ourselves:

* Allocating scratch in a caller speeds things up for small operations
  in various ways.  We should allocate as little as possible for
  spatial locality and thus better cache utilisation ad TLB
  utilisation.  But spending many cycles on minimising allocation is
  not a good stratagy here.

* Allocating scratch space in a caller for huge operations will not
  save any time.  But reusing scratch by passing already allocated
  scratch can save space.

* Saving memory for the largest operations is very important.  If we
  compare the code used for the recent Pi record and GMP using
  gmp-chudnovski.c, the record Pi code is many times more memory
  efficient.  One reason for GMP's memory apetite is that the
  scratch/itch interface is not implemented throughout, so e.g. the FFT
  code will not e able to make use of existing scratch memory that
  callers have available.

So the scratch/itch interface for large and small operations serve
different purposes, saving time and saving space.  We need to keep that
in mind.  Concretely: We should not be afraid to spend thousands of
cycles for conserving memory when we have operands of millions of limbs,
according to a principle of "relative overhead".

What should we do?

I think we should as a first step make large operations accept but not
require scratch space, while small operations should continue to require
it.  (Marco already did this for GMP 5.0 in at least one place.)

As a possible second step, we might want to introduce a more complex
scratch management system for large operations.  Often, a caller has
several unused pieces of memory, it would make sense to allow it to pass
them all.  ("Here are n bytes, and here are m bytes, use them as you
please.")

We will probably never reach the specialised Pi program's memory
efficiency, since we cannot glue GMP's routines as closely together as
is possible in such a program.  But I think we could get close.

-- 
Torbjörn