GMP «Arithmetic without limitations» Plans for GMP 5.1



This is an attempt at defining a development target for GMP 5.1, to be released in 2012. [Last updated 2011-12-04]

Colour codes

[1] This really ought to be done before release

[2] Try to get this done before release

[n] Done!

[n] Leave for 5.2/6.0!


Bugs (in development sources)

Balanced multiplication

Unbalanced multiplication

Handle very unbalanced multiplication by "transforming" the smaller operand, then multiply using toomX2, toomX3, using the transformed value. Similarly in FFT range.

Multiply with FFT

Merge Niels' small-primes FFT code. Make sure it is memory efficient.

Merge new mul_fft.c. [Stalls on SFLC analysis.] Alternatively, improve the present code ourselves, or perhaps reimplement the algorithm from scratch.

Extend SS FFT table more intelligently, taking 'goodness' into account. (For both plain and fat builds.)

Finish mulmod_bnm1

Write assembly basecase mulmod_bnm1.

Short products

Write assembly mullo_basecase for x86-64

Merge David's mpn_mulmid code, use it at least for mpn_binvert.

Division m,n

Implement van der Hoeven's MU generalisation.

Perfect algorithm selection for nn-limb by dn-limb division.

Add pi/preinv variants for all mu functions. [2h]

We still use mpn_tdiv_qr or even mpn_divrem in many files. Replace by current division function:
mpn mpz mpf
generic/divrem.c jacobi.c get_str.c
generic/gcd.c powm.c set_q.c
generic/gcd_subdiv_step.c powm_ui.c set_str.c
generic/gcdext.c tdiv_qr.c ui_div.c
generic/get_str.c tdiv_r.c
generic/powm.c
generic/powm_sec.c

These files also use mpn_tdiv_qr, but just for nails:

mpz/cdiv_q_ui.c mpz/cdiv_qr_ui.c mpz/cdiv_r_ui.c mpz/cdiv_ui.c mpz/fdiv_q_ui.c mpz/fdiv_qr_ui.c mpz/fdiv_r_ui.c mpz/fdiv_ui.c mpz/tdiv_q_ui.c mpz/tdiv_qr_ui.c mpz/tdiv_r_ui.c mpz/tdiv_ui.c

Divide-by-fewlimb and modulo-by-fewlimb

[This section has become slightly obsolete. E.g., modexact_1_odd is considered deprecated.]

Implement mpn_invert_limb for x86_64 chips with fast 32-bit div, using an initial 32-bit approximation followed by a Newton step. This might win for Intel Core2, Corei and Atom, and VIA Nano. [No improvement seen in my tests--tege]

Add new set of bdiv_1, bdiv_2 and div_1, div_2 functions, with using 1-limb, two-limb inverses. We should provide two or three mod-by-fewlimb chains for when no invariance is present:

For the 1st and 3rd category, precomputation also makes sense. The right routine for that can be determined from the above data; if a THRESHOLD becomes MP_SIZE_T_MAX, then the penultimate routine should be used.

We need to define a plain 2/1-division based mod_1 loops (norm, unorm) as separate abstractions, since this is something all processors will need. This loop will in almost all cases be used just for very small n. Selection mechanisms between the mod functions could still be in C.

Commit lots of new assembly code for mod_1s_*.

Commit lots of new assembly code for div_1_*.

Commit lots of new assembly code for div_2_*.

mpz_remove, mpn_remove

Make mpz_remove use mpn_remove. Ready: just check for speed regressions.

Use mpn_remove from mpn/generic/perfpow.c.

mpn_broot

Implement b-adic root, r^k*y mod B^n, given k, y, n. A start is binv_root in mpn/generic/perfpow.c.

Exact division

We have an ugly mpn_divexact using Jebelean's bidirectional algorithm. Clean it up, and probably permit even divisors.

GCD

Use mulmod_bnm1 for cancelling operand updates. [Partly done]

Replace gcd_1 by gcd_11, and implement gcd_22. Implement gcd_1 in C, calling an abstract gcd-modulo function (see Divide-by-fewlimb and modulo-by-fewlimb above).

Factorial, binomial

Our n-factorial code is complex, ugly, and slow for large operands. Reimplement asymptotically optimally.

Our n-choose-k code is rudimentary. Reimplement using a primes sieve. Also write new basecase code. [Work-in-progress]

Intra-library references

Intra-library references, i.e., function calls and reads/writes of global variables, use various access structures like GOT and PLT. These are not needed, and at least in ELF we may suppress them using attributes such as "protected" and "hidden". (It seems something similar exists also on Darwin; here the compiler adds a .private_extern to the declaration of a function, but does not change calls [until supposedly in the final link].)

Fix calls to strictly internal routines.
Fix references to strictly internal variables (e.g., alloc func pointers).

Fix calls to remaining mpn routines (perhaps using hidden+alias).

Fix calls via gmpn_cpuvec (x86/fat/fat_entry.asm, x86_64/fat/fat_entry.asm). The table itself needs a GOT and each entry points to a PLT entry...

Configuru

Improve C++ configure, see comment in configure.in. In particular, enforce selected ABI also for C++ (through CXXFLAGS).

Add more fat routines, as well as more fat thresholds. A good starting point is "struct cpuvec_t" in gmp-impl.h.

Handle non-recognised processors in the fat mechanisms (x86/fat/fat.c, x86_64/fat/fat.c) more robustly. Unrecognised processors should be handled using cpuid's feature mechanism, plus some heuristics to match the best known CPU. (This will make sure we never try to execute non-existing instructions.)

Consider making fat the default. This could allow us to make more fine grained tuning, without trying to invent ever stranger configure CPU names.

Make GMP_CPU_TYPE fat CPU selection standard for a fat build (but perhaps rename it to something more specific, GMP_FAT_CPU_TYPE_SELECT). Motive: Testability.

Improve thresholds handling for fat builds, with the goal of making a fat lib the same speed as soon as operands are large enough.

Consider moving the exact CPU recognition from the configure triplet to some separate options (perhaps "--with-cpu").

Port x86_64 assembly to Windoze64.

C++ interface

TBD.

Test environment

Implement artificially small limbs, ASL™. Using C++ we can overload the mp_limb_t type, and let it be any size smaller-than 'long long'. This will allow for better testing. [Work-in-progress]

Implement a mechanism as part of the nightly builds for finding speed regressions.

Misc

Fix ppc970 and itanium2 slowdown for small powm.

Put new low level primitives in tune/speed.c, tests/devel/try.c. [RECURRENT]

Remove many divrem_1.{asm,S,s} since C code is faster already in GMP 4.3. (If the top-level x86 or x86_64 divrem_1.asm disappear, fat.c needs updating.)

Emmanuel Thomé's SUBDIR order.

Update minithres with any new THRESHOLDs. [RECURRENT]

Disable asm files using --disable-asm configure option.


Please send comments about this page to gmp-discuss@gmplib.org
Copyright 2009, 2010, 2011 Free Software Foundation
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.