Itemised plans for GMP

This is an attempt at defining a development target for the next major GMP release. We might not implement every item here for that release, and we will surely make some developments missing from this list.

Colour codes

[1] This really ought to be done before release

[2] Try to get this done before release

[n] Done!

[n] Leave for subsequent releases!

Bugs (in development sources)

We miss lots of rules in libtool (and configure, elsewhere?) because they match "x86_64" and "i386" instead of all our CPUs. Among unknown things, this causes solaris shared builds to silently fall back to static builds (presumably since anno dazumal).

Balanced multiplication

Unbalanced multiplication

Handle very unbalanced multiplication by "transforming" the smaller operand, then multiply using toomX2, toomX3, using the transformed value. Similarly in FFT range.

Multiply with FFT

Merge Niels' small-primes FFT code. Make sure it is memory efficient.

Merge new mul_fft.c. [Stalls on SFLC analysis.] Alternatively, improve the present code ourselves, or perhaps reimplement the algorithm from scratch.

Extend SS FFT table more intelligently, taking 'goodness' into account. (For both plain and fat builds.)

Implement the √ 2 trick: 2^3n/4−2^n/4 is a square root of 2 mod (2ⁿ+1). This allows for smaller coefficients.

Short products

Merge David's mpn_mulmid code, use it at least for mpn_binvert.

Division m,n

Implement van der Hoeven's MU generalisation.

Perfect algorithm selection for nn-limb by dn-limb division.

Add pi/preinv variants for all mu functions. [2h]

We still use mpn_tdiv_qr or even mpn_divrem in many files. Replace by current division functions:

mpn mpz mpf

generic/divrem.c jacobi.c get_str.c

generic/gcd.c powm.c set_q.c

generic/gcd_subdiv_step.c set_str.c

generic/gcdext.c tdiv_qr.c ui_div.c

generic/get_str.c tdiv_r.c

generic/powm.c

mpn	mpz	mpf
generic/divrem.c	jacobi.c	get_str.c
generic/gcd.c	powm.c	set_q.c
generic/gcd_subdiv_step.c	set_str.c
generic/gcdext.c	tdiv_qr.c	ui_div.c
generic/get_str.c	tdiv_r.c
generic/powm.c

These files also use mpn_tdiv_qr, but just for nails:

mpz/cdiv_q_ui.c mpz/cdiv_qr_ui.c mpz/cdiv_r_ui.c mpz/cdiv_ui.c mpz/fdiv_q_ui.c mpz/fdiv_qr_ui.c mpz/fdiv_r_ui.c mpz/fdiv_ui.c mpz/tdiv_q_ui.c mpz/tdiv_qr_ui.c mpz/tdiv_r_ui.c mpz/tdiv_ui.c

Divide-by-fewlimb and modulo-by-fewlimb

Complete set of bdiv_1, bdiv_2 and div_1, div_2 functions, using 1-limb, 2-limb inverses.

Define plain 2/1-division based mod_1 loops (norm, unorm) as separate abstractions, since this is something all processors will need. This loop will in almost all cases be used just for very small n, since mod_1s-family functions will take over very quickly. Selection mechanisms between the mod functions could still be in C.

mpz_remove, mpn_remove

Call mpn_remove from mpn/generic/perfpow.c.

mpn_redc vs mpn_bdiv

Finish unification of bdiv/redc. The internal (sorry!) repo ~hg/gmp-proj/gmp-bdiv has a good start.

mpn_broot, mpn_binvroot, mpn_bsqrt, mpn_binvsqrt

Implement wrap-around trick for the -adic root code.

Exact powers

Add companion for mpz_perfect_power_p that returns the exponent and optionally the root. Possible name mpz_rootexact. Marco suggested some alternative functions, which could test a number and extract its nth power. (Sept/Oct 2012).

Exact division

We have an ugly mpn_divexact using Jebelean's bidirectional algorithm. Clean it up, and probably permit even divisors.

GCD

Use mulmod_bnm1 for cancelling operand updates. [Partly done]

Unlike most basic functions in GMP, GCD depends on C code and thus on compilers. Furthermore, extended GCD lacks proper basecase code, and instead invoke the heavy machinery.

Replace gcd_1 by gcd_11, accepting two limbs. Implement gcd_1 in C for compatibility.

Implement gcd_22 in C, also in assembly using compiler generation.

Implement gcdext_22.

Modexp, b^d mod m

See also the top-level developers' corner page for a deeper discussion on this subject.

Use special code for when b is much smaller than m, avoiding REDC representation and table precomputation. New pseudo-code at the top devel page.

Side-channel silent functions

Calls and linkage

Intra-library references, i.e., function calls and reads/writes of global variables, use various access structures like GOT and PLT. These are not needed, and at least in ELF we may suppress them using attributes such as "protected" and "hidden". (It seems something similar exists also on Darwin; here the compiler adds a .private_extern to the declaration of a function, but does not change calls [until supposedly in the final link].)

This can also be used to enforce the policy that certain functions are really internal to the library. Depending on how one sees such policies, such enforcing might be good or bad. We should probably use it with care. Calls that are declared with attribute "internal" will be the fastest, since then the GOT pointer can be assumed correct.

Fix calls to strictly internal routines, using the "internal" attribute, e.g. allowing compiler suppression of GOT setup code.
Fix references to strictly internal variables (e.g., alloc func pointers).

Fix calls to remaining mpn routines (perhaps using hidden+alias).

Fix calls via gmpn_cpuvec (x86/fat/fat_entry.asm, x86_64/fat/fat_entry.asm). The table itself needs a GOT and each entry points to a PLT entry...

Use restrict keyword in most mpn functions.

Use visibility hidden for all symbols used internally, then make documented symbols visible with alias. See star hacker Ian Lance Taylor's explanations.

What happens if we use __attribute__ visibility("hidden") on a platform which does not support symbol visibility? ("Nothing" is OK.)
Do we need to mark a symbol as hidden for each use? Clearly, for C/C++ the compiler needs to know about the hidden attribute in order to use the optimal relocs. But how about assembly files which merely use a symbol, does ".hidden foo" change anything wrt foo?
Fat binaries contain generated symbols, i.e., __gmpn_addlsh2_n_core2, __gmpn_mod_1_1p_x86_64. We need to generate corresponding "hidden" decls.

Configuru

Improve C++ configure, see comment in configure.in. In particular, enforce selected ABI also for C++ (through CXXFLAGS).

Add more fat routines, as well as more fat thresholds. Ideally, any function with exist in a top-level mpn/cpufam directory and is also provided specially for at least one (relevant) sub-cpu, should be fat.

Consider making fat the default. This could allow us to make more fine grained tuning, without trying to invent ever stranger configure CPU names.

Make GMP_CPU_TYPE fat CPU selection standard for a fat build (but perhaps rename it to something more specific, GMP_FAT_CPU_TYPE_SELECT). Motive: Testability.

Improve thresholds handling for fat builds, with the goal of making a fat lib as fast as a slim lib.

Consider moving the exact CPU recognition from the configure triplet to some separate options (perhaps "--with-cpu").

C++ interface

TBD.

Assembly code

By "common CPUs" we mainly mean AMD k10/bulldozer/piledriver, Intel SBR/IBR/HWL, POWER 7, ARM A9/A15.

Commit lots of new assembly code for div_1_*.

Commit lots of new assembly code for div_2_*.

We have lots of great multiply primitives like mul_1, mul_2, addmul_1, addmul_2, but development lags for O(n²) functions like mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2, mulmod_bnm1, sqrmod_bnm1. Perhaps we could define the primitive functions in such a way that we could automatically assemble reasonably good O(n²) functions?

Test environment

Implement artificially small limbs, ASL™. Using C++ overload mp_limb_t, and let it be any size smaller-than long long. This will allow for better testing. [Work-in-progress]

Implement a mechanism as part of the nightly builds for finding speed regressions.

Misc

Improve threshold cutoff points by modelling the run time of algorithm A and algorithm B according to their complexity. E.g., the sec modexp table sizes today are chosen from the exponent size only. That's poor since the ring size in reality influences things significantly.

Document how to use stdarg on mpz_t, mpq_t, mpf_t. This requires that we make mpz_ptr, etc, public. Example use in mpz/inits.c.

Put new low level primitives in tune/speed.c, tests/devel/try.c. [RECURRENT]

Remove many divrem_1.asm since C code is faster already in GMP 4.3. (If the top-level x86 or x86_64 divrem_1.asm disappear, fat.c needs updating.)

Emmanuel Thomé's SUBDIR order.

Update minithres with any new THRESHOLDs. [RECURRENT]