FAT GMP 5 binaries

Torbjorn Granlund tege@swox.com
14 May 2003 14:53:44 +0200


For GMP 5, I'd like to evaluate the possibility to make fat
binaries with run-time choice of optimial routines.

Given an ABI, all (or an explicitly chosen subset of) the
implementation-tailored functions for that ABI should be
compiled.  For each routine subject to such code duplication, an
entry in a vector should be filled in.

For example, the function mpn_addmul_1 would tailcall
*(__gmp_cpuvec[RNUM_GMP_ADDMUL_1]).  If that has not previously
been called, we now hit a setup routine, which checks what CPU
we're running on and writes the appropriate address to
__gmp_cpuvec[RNUM_GMP_ADDMUL_1].  (No worries about reentrancy,
we should write pointers atomically.)

We should recognize most known processors, and chose our
specially tailored routines for those.  Such routines would be
called mpn_addmul_1_athlon, mpn_addmul_1_k6,
mpn_addmul_1_pentium4, etc.

For other processors, such as future ones that we cannot know
anything about, we should check the cpuid capability bit vector.
Does the processor support SSE2?  Use the mpn_addmul_1_sse2 (or
perhaps in this case mpn_addmul_1_pentium4).  Else, use the
generic mpn_addmul_1_x86.

Internal GMP calls would use __gmp_cpuvec[RNUM_GMP_ADDMUL_1]
directly, via some gmp-impl.h macros.  The overhead would
be a few cycles.

We need to decide what routines to put in __gmp_cpuvec.  All GMP
rouines is one choice.  All mpn routines or a subset of mpn
rouines are other choices.

With some special mechanism, we could actually allow run-time
selection of vectors from timing tests, usable for processors
that we didn't know at the time of a release.  That should be
right down your alley.  :-)

The current situation is getting out-of-hand, and people aren't
getting the performance they could.  While fat binaries aren't
ideal either, they are better than the generic x86 code that many
people end up using today.

I'd like to make the same trick with other processor families.

--
Torbjörn