Itemised plans for GMP
Last modified: 2016-12-17
This is an attempt at defining a development target for the next major GMP release. We might not implement every item here for that release, and we will surely make some developments missing from this list.
 This really ought to be done before release
 Try to get this done before release
[n] Leave for subsequent releases!
We miss lots of rules in libtool (and configure, elsewhere?) because they match "x86_64" and "i386" instead of all our CPUs. Among unknown things, this causes solaris shared builds to silently fall back to static builds (presumably since anno dazumal).
Handle very unbalanced multiplication by "transforming" the smaller operand, then multiply using toomX2, toomX3, using the transformed value. Similarly in FFT range.
Merge Niels' small-primes FFT code. Make sure it is memory efficient.
Merge new mul_fft.c. [Stalls on SFLC analysis.] Alternatively, improve the present code ourselves, or perhaps reimplement the algorithm from scratch.
Extend SS FFT table more intelligently, taking 'goodness' into account. (For both plain and fat builds.)
Implement the √ 2 trick: 23n/4−2n/4 is a square root of 2 mod (2n+1). This allows for smaller coefficients.
Merge David's mpn_mulmid code, use it at least for mpn_binvert.
Implement van der Hoeven's MU generalisation.
Perfect algorithm selection for nn-limb by dn-limb division.
Add pi/preinv variants for all mu functions. [2h]
We still use mpn_tdiv_qr or even mpn_divrem in many files. Replace by current division functions:
mpn mpz mpf generic/divrem.c jacobi.c get_str.c generic/gcd.c powm.c set_q.c generic/gcd_subdiv_step.c set_str.c generic/gcdext.c tdiv_qr.c ui_div.c generic/get_str.c tdiv_r.c generic/powm.c
These files also use mpn_tdiv_qr, but just for nails:
mpz/cdiv_q_ui.c mpz/cdiv_qr_ui.c mpz/cdiv_r_ui.c mpz/cdiv_ui.c mpz/fdiv_q_ui.c mpz/fdiv_qr_ui.c mpz/fdiv_r_ui.c mpz/fdiv_ui.c mpz/tdiv_q_ui.c mpz/tdiv_qr_ui.c mpz/tdiv_r_ui.c mpz/tdiv_ui.c
Complete set of bdiv_1, bdiv_2 and div_1, div_2 functions, using 1-limb, 2-limb inverses.
Define plain 2/1-division based mod_1 loops (norm, unorm) as separate abstractions, since this is something all processors will need. This loop will in almost all cases be used just for very small n, since mod_1s-family functions will take over very quickly. Selection mechanisms between the mod functions could still be in C.
Call mpn_remove from mpn/generic/perfpow.c.
Finish unification of bdiv/redc. The internal (sorry!) repo ~hg/gmp-proj/gmp-bdiv has a good start.
Implement wrap-around trick for the -adic root code.
Add companion for mpz_perfect_power_p that returns the exponent and optionally the root. Possible name mpz_rootexact. Marco suggested some alternative functions, which could test a number and extract its nth power. (Sept/Oct 2012).
We have an ugly mpn_divexact using Jebelean's bidirectional algorithm. Clean it up, and probably permit even divisors.
Use mulmod_bnm1 for cancelling operand updates. [Partly done]
Unlike most basic functions in GMP, GCD depends on C code and thus on compilers. Furthermore, extended GCD lacks proper basecase code, and instead invoke the heavy machinery.
Replace gcd_1 by gcd_11, accepting two limbs. Implement gcd_1 in C for compatibility.
Implement gcd_22 in C, also in assembly using compiler generation.
See also the top-level developers' corner page for a deeper discussion on this subject.
Use special code for when b is much smaller than m, avoiding REDC representation and table precomputation. New pseudo-code at the top devel page.
Intra-library references, i.e., function calls and reads/writes of global variables, use various access structures like GOT and PLT. These are not needed, and at least in ELF we may suppress them using attributes such as "protected" and "hidden". (It seems something similar exists also on Darwin; here the compiler adds a .private_extern to the declaration of a function, but does not change calls [until supposedly in the final link].)
This can also be used to enforce the policy that certain functions are really internal to the library. Depending on how one sees such policies, such enforcing might be good or bad. We should probably use it with care. Calls that are declared with attribute "internal" will be the fastest, since then the GOT pointer can be assumed correct.
Fix calls to strictly internal routines, using the
"internal" attribute, e.g. allowing compiler suppression of GOT setup
Fix references to strictly internal variables (e.g., alloc func pointers).
Fix calls to remaining mpn routines (perhaps using hidden+alias).
Fix calls via gmpn_cpuvec (x86/fat/fat_entry.asm, x86_64/fat/fat_entry.asm). The table itself needs a GOT and each entry points to a PLT entry...
restrict keyword in
visibility hidden for all
symbols used internally, then make documented symbols visible with alias.
See star hacker Ian Lance Taylor's explanations.
- What happens if we use __attribute__ visibility("hidden") on a platform which does not support symbol visibility? ("Nothing" is OK.)
- Do we need to mark a symbol as hidden for each use? Clearly, for C/C++ the compiler needs to know about the hidden attribute in order to use the optimal relocs. But how about assembly files which merely use a symbol, does ".hidden foo" change anything wrt foo?
- Fat binaries contain generated symbols, i.e., __gmpn_addlsh2_n_core2, __gmpn_mod_1_1p_x86_64. We need to generate corresponding "hidden" decls.
Improve C++ configure, see comment in configure.in. In particular, enforce selected ABI also for C++ (through CXXFLAGS).
Add more fat routines, as well as more fat
thresholds. Ideally, any function with exist in a
mpn/cpufam directory and is also provided specially
for at least one (relevant) sub-cpu, should be fat.
Consider making fat the default. This could allow us to make more fine grained tuning, without trying to invent ever stranger configure CPU names.
Make GMP_CPU_TYPE fat CPU selection standard for a fat build (but perhaps rename it to something more specific, GMP_FAT_CPU_TYPE_SELECT). Motive: Testability.
Improve thresholds handling for fat builds, with the goal of making a fat lib as fast as a slim lib.
Consider moving the exact CPU recognition from the configure triplet to some separate options (perhaps "--with-cpu").
By "common CPUs" we mainly mean AMD k10/bulldozer/piledriver, Intel SBR/IBR/HWL, POWER 7, ARM A9/A15.
Commit lots of new assembly code for div_1_*.
Commit lots of new assembly code for div_2_*.
We have lots of great multiply primitives like mul_1, mul_2, addmul_1, addmul_2, but development lags for O(n2) functions like mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2, mulmod_bnm1, sqrmod_bnm1. Perhaps we could define the primitive functions in such a way that we could automatically assemble reasonably good O(n2) functions?
Implement artificially small limbs, ASL™.
Using C++ overload
mp_limb_t, and let it be any size
long long. This will allow for better testing.
Implement a mechanism as part of the nightly builds for finding speed regressions.
Improve threshold cutoff points by modelling the run time of algorithm A and algorithm B according to their complexity. E.g., the sec modexp table sizes today are chosen from the exponent size only. That's poor since the ring size in reality influences things significantly.
Document how to use stdarg on mpz_t, mpq_t, mpf_t. This requires that we make mpz_ptr, etc, public. Example use in mpz/inits.c.
Put new low level primitives in tune/speed.c, tests/devel/try.c. [RECURRENT]
Remove many divrem_1.asm since C code is faster already in GMP 4.3. (If the top-level x86 or x86_64 divrem_1.asm disappear, fat.c needs updating.)
Emmanuel Thomé's SUBDIR order.
Update minithres with any new THRESHOLDs. [RECURRENT]