Rethinking GMP's configure

Mon Dec 5 23:20:07 CET 2011

With each new cpu, we invent a new name for it, since vendor names are
only feebly related to the actual core type.  Our names describe the
actual core type, or microarchitecture.

We use this name in our configure strings, such as coreisbr-pc-linux-gnu
or k10-pc-freebsd.  This is a slight abuse of the configure system, as
we communicate not ISA but microarchitecture.  For cross compiles, the
names will be useless for finding cross tools, as have been pointed out
on the GMP lists.

The speed for e.g., a coreisbr binary when run on e.g., a k10 cpu will
be far from optimal.  Do users know this?  Probably not.  How about
binary packages that link with GMP?

We have fat binaries for x86/32 and x86/64, available under the
configure option --enable-fat.  It currently works OK, but it is quite
rudimentary in its selection of code, and its threshold values are
unconditionally based on nonfat builds.

I'd like to move away from a specifically configured lib to a
default-is-fat model.  This is how it could work for a user:

Build GMP for the k8 and coreisbr CPUs:
  configure --with-cpu=k10,coreisbr && make && make check && make install

Build GMP for just one CPU:
  configure --with-cpu=bobcat && make && make check && make install

Build GMP for all CPUs (which adhere to the config.guess ISA/ABI):
  configure --with-cpu=ALL && make && make check && make install

Build GMP for the host CPU:
  configure --with-cpu=HOST && make && make check && make install

Note that I am not suggesting any change to ABI selection.  I think we
should still go for the fastest ABI by default.  (Implementing multilibs
is a different project altogether.)

How do we implement this?  What about the overhead?

First, let's admit there will be some overhead.  At some level, there
will be a need to jump through a table, initiated for the run-time CPU.
This indirection costs a few cycles each time.  But note that in a
shared library, we typically access data through a GOT (global offset
table) and call functions indirectly through a PLT (procedure linkage
table).  I think we could stay within the overhead of shared library
calls, if we do things right.  (For ELF, that is, where we can control
such things.)

To decrease overhead, perhaps mainly in the static library, code
selection can be made not at the calls to the most primitive functions,
but a bit "higher"; we need to compile such functions several times with
fixed primitive functions (i.e., with calls to these going directly, not
through a jump table).

There are two main performance problems with the current fat mechanism:

(1) We only include a basic set of assembly primitives, and never
functions classified as optional.  The optional functions have become
important to e.g., multiply speed.  (See any toom function to see why.)

(2) We don't do proper algorithm selection; we just import 5 threshold
from each CPU.  But these threshold are based on measurements with *all*
assembly primitives included.  We don't handle FFT thresholds at all...

Todo list:

1. Allow for inclusion of all assembly primitives, those categorised as
   optional as well as those always provided (at least as "generic"
   primitives).

2. When using an optional assembly primitive, use a run-time test if the
   optional primitive is provided by *any* but not all configured CPUs,
   exclude it if it is never available, and include it without a
   run-time test if it is always available.

   For mid-level functions, instead of using a run-time test as per
   above, compile file multiple times.  (Actually, perhaps this might be
   over-complex; we could compile all such functions for each CPU, to
   allow for CPU-specific compiler options.)

3. Handle accurate algorithm selection, by allowing for all threshold
   values.

4. Perhaps make fat binaries work for more CPUs, such as powerpc, arm.

Any takers?  This might be a suitable 3-6 month student project for a
very good student.

-- 
Torbjörn