Arithmetic without limitations?

Wed Feb 10 21:00:27 CET 2010

Paul Zimmermann <Paul.Zimmermann at loria.fr> writes:

  yesterday Fabrice Bellard, who holds the current record for computing digits of
  Pi, gave a talk in our lab. His slides (in french) are available at
  http://misn.loria.fr/fichiers/bellard-20100209.pdf. Slide 34 compares his
  implementation (TPI) with gmp-chudnovsky.c and GMP 5.0.1. You can try TPI
  from http://bellard.org/pi/pi2700e9/tpi.html.

Oh, GMP is not that terribly much behind as I thought.  At least not for
measly little numbers such as 10^8.

  The most amazing thing is not that he set a new record, but that he did it on
  a personal computer with only 6GB of RAM, where the largest multiplication he
  had to perform was between two numbers of 2700 billion decimal digits, i.e.,
  about 1.1TB each!

Incidentally, I have long planned to do exactly what he did, once we had
the Cooley-Tukey small primes code in place.  My goal was to make a
spalsh for GMP 5, but in the end we decided to postpone the new FFT
code.  Yes, I am little envious not to have done this first.  :-)

  This is really what we are missing in GMP. The current GMP algorithms are so
  fast (almost linear in the input size) that time is no more a problem, but more
  and more memory is a real limitation.

  To set his record, Fabrice designed a multiprecision library (yes, from
  scratch) with the same API as GMP, but with an intermediate level "mpt"
  between "mpn" and "mpz". The idea is that a "mpt" can either contain an
  array of limbs (mpn) in memory, or points to a file on disk. The mpn level
  (now mpt) has to be redesigned to be able to handle inputs or outputs on disk,
  but then every program using mpt (including mpz) will work without any change.

  I believe in the future GMP should give facilities for "out-of-core" computing.

My idea for GMP has long been to make "hierarchical locality" take care
of it all.  A row in in the k-dimensional matrix would fit into L1
cache, a plane would fit into memory, further dimensions would live in
swap space (not exlicit files).

I.e., I'd like to try relying on the operating system algorithms for
paging, together will well organised algortihms.

I am sure that we will never get exactly as fast as any program with
full overview of the operations.  We wll always be slightly more memory
greedy, and might not be able to prefetch from disk as easily.  But if
we're really crazy, we could use a prefetch thread for helping th FFT
code; that thread would bring in pages of the data to be used next.

My goal is to reach 99% L1 cache hit for a multiply of something huge
like 2^40 bits and no CPU stalling on page loads...

-- 
Torbjörn