| GMP Development Tidbits or GMP(tm) Technology(tm) Preview(tm) |
Here the GMP developers present some new code that will eventually go into a GMP release. Please use with caution, as this code might not always be well-tested, and in some cases might not even compile. (But if it doesn't compile, we probably forgot to include some needed pieces; please let us know if that happens.)
None of the interfaces or conventions used by these functions can be safely assumed to be present in any future GMP release.
You are encouraged to study this code and help us improve it. Such improvements can be speedups, or code simplifications.
Note: This code is not simple drop-in replacements for any current code, and Makefiles will not magically know how to build this code, nor will current GMP functions magically know how to call these new functions.
NB: Most of the code here is released under the GPL, not the LGPL. This means that it cannot be used at all in non-free programs.
This is Niels Möller's sub-quadratic GCD code, implementing both plain gcd and extended gcd. It is believed to be quite stable at this point.
Comparison of gcd routines. The top row denotes number of bits in the input arguments:
| Times in µs | 100 | 1000 | 10000 | 100,000 | 1,000,000 | 10,000,000 | 100,000,000 | 1,000,000,000 |
|---|---|---|---|---|---|---|---|---|
| mpn_gcd (new) | 0.672 | 18.6 | 279 | 8,370 | 251,000 | 5,940,000 | 112,000,000 | 1,890,000,000 |
| mpn_gcdext (new) | 1.26 | 18.6 | 341 | 11,100 | 389,000 | 10,200,000 | 202,000,000 | 3,390,000,000 |
| mpn_gcd (4.2) | 0.656 | 18.8 | 292 | 12,800 | 1,930,000 | 360,000,000 | ||
| mpn_gcdext (4.2) | 2.70 | 39.0 | 556 | 24,400 | 3,220,000 | 767,000,000 |
| Download files | Last changed |
|---|---|
| mpn/generic/gcd.c | 2005-06-13 |
| mpn/generic/gcdext.c | 2005-06-13 |
| mpn/generic/hgcd.c | 2005-03-31 |
| mpn/generic/hgcd2.c | 2004-01-25 |
| mpn/generic/qstack.c | 2003-12-19 |
| gmp-impl.h-gcd-newlines | 2005-11-17 |
This code greatly improves multiplication performance for when the two operands have different sizes, also known as unbalanced operands.
The code is believed to be stable and should be straightforward to deploy.
| Download files | Last changed | What changed? |
|---|---|---|
| mpn/generic/mul.c | 2006-12-29 | |
| mpn/generic/mul_toom22.c | 2007-01-11 | test code bug fix |
| mpn/generic/mul_toom32.c | 2007-01-11 | test code |
| mpn/generic/mul_toom42.c | 2007-01-13 | interpolate call |
| mpn/generic/toom_interpolate_5pts.c | 2007-01-13 | new |
We are working on more Toom functions, see below. In addition,
mpn_mul_toomM3 for M = 4, 6 are planned, using 6 and
8 evaluation points, respectively. mpn_mul_toom33 is like the
present GMP 4.2 mpn_mul_toom3 except that the former allows the
operands to be slightly unbalanced.
We believe that even more Toom functions might be needed for optimal
performance, perhaps going as high as mpn_mul_toomM7. This will
largely depend on ongoing work on FFT multiplication; the faster that will
become, the less room for higher-order Toom. If we need to go to order-7,
there will be a forbidding number of functions, in order to support unbalanced
operands as well as we support balanced operands.
The following functions are more bleeding edge, and will be integrated with
mul.c at some point. The code is under active development,
and is therefore not unlikely to have bugs.
| Download files | Last changed | What changed? |
|---|---|---|
| mpn/generic/mul_toom33.c | 2007-01-14 | new |
| mpn/generic/toom_interpolate_5pts.c | 2007-01-13 | new |
| mpn/generic/mul_toom44.c | 2007-01-11 | new |
| mpn/generic/mul_toom53.c | 2007-01-14 | optimized |
| mpn/generic/mul_toom62.c | 2007-01-11 | new |
| mpn/generic/toom_interpolate_7pts.c | 2007-01-11 | new |
This new tool is useful for for optimizing the evaluation sequences of Toom functions. It exhaustively searches for possible sequences, given a goal function. It is similar to the GNU superoptimizer, except that this program optimizes over a a simple algebraic structure and handles m-ary -> n-ary projections (the GNU superoptimizer optimizes over an assembly language subset, and handles m-ary -> 1-ary projections). This is very immature code, but it seems to work.
| Download files | Last changed | What changed? |
|---|---|---|
| superopt-va.c | 2007-01-16/2 | speedup+UI improvement |
| goal-va.def | 2007-01-16 | bug fixes |
This is new code for truncating division and Hensel division at the mpn level. There is also new exact division code, exact meaning that it is known that N mod D = 0. The new code has complexity O(M(n)) for all operations, while the old mpz code is O(M(n)log(n). More precisely, this code takes time 2M(n) for large n.
This code was written by Niels Möller and Torbjörn Granlund.
Notation: nn, dn, qn are the limb count of N, D, and Q, respectively. M(n) is the time use for multiplication of two n-bit numbers.
For small operands, simple O(n²) "schoolbook" code is used, for medium operands a divide-and-conquer algorithm is used, and for large operands a variant of Barrett's algorithm is used. The files names contain "sb", "dc", and "mu" for the respective algorithm.
We make use of that large operand multiplication in GMP uses FFT which computes multiplication mod 2^m+1, for some value of m. In the inversion Fermat iteration code, we know that the low part is ...00001. In the quotient development code, we know that the low or upper part cancels with the partial remainder. We can therefore choose m smaller than our desired products, and let the upper products "wrap" into the low part of the product.
Our algorithm is similar to Barrett's algorithm, except that we compute an inverse which is typically smaller than D; the inverse has sometimes just 1/2 or 1/3 as many digits as D.
Further work in these areas is desirable:
The interface of this code might change multiple times before becoming part of GMP, and even then the interface will be undocumented/internal. The code is under active development, and is therefore not unlikely to have bugs.
| Download files | Last changed | What changed? |
|---|---|---|
| new.h | 2006-11-07 | |
| mpn/generic/sb_div_qr.c | 2007-01-03 | |
| mpn/generic/sb_divappr_q.c | 2007-01-03 | |
| mpn/generic/sb_div_q.c | 2007-01-03 | |
| mpn/generic/dc_div_qr.c | 2007-01-03 | |
| mpn/generic/dc_divappr_q.c | 2007-01-07 | new |
| mpn/generic/dc_div_q.c | 2007-01-07 | new |
| mpn/generic/invert.c | 2006-12-17 | file retracted by author |
| mpn/generic/mu_div_qr.c | 2006-11-07 | |
| mpn/generic/mu_divappr_q.c | 2006-11-10 | |
| mpn/generic/mu_div_q.c | 2007-01-07 | improved |
| mpn/generic/sb_bdiv_qr.c | 2006-11-07 | |
| mpn/generic/sb_bdiv_q.c | 2006-11-07 | |
| mpn/generic/dc_bdiv_qr.c | 2007-01-03 | |
| mpn/generic/dc_bdiv_q.c | 2006-12-31 | |
| mpn/generic/binvert.c | 2006-11-07 | |
| mpn/generic/mu_bdiv_q.c | 2006-11-07 | |
| mpn/generic/divexact.c | 2006-11-07 | |
| test-bdivisions.c | 2006-11-07 | |
| test-binvert.c | 2006-11-07 | |
| test-divexact.c | 2006-11-07 | |
| test-divisions.c | 2006-11-07 | |
| test-invert.c | 2006-11-07 |
The following tables compares the speed of division functions. The first table uses dividends that are precisely twice as large as the divisors; the second table uses dividends that are 11 times as large as the divisors. Times are in microseconds; operand sizes are in bits.
| dividend divisor | 200 100 | 2000 1000 | 20000 10000 | 200,000 100,000 | 2,000,000 1,000,000 | 20,000,000 10,000,000 | 200,000,000 100,000,000 | 2,000,000,000 1,000,000,000 |
|---|---|---|---|---|---|---|---|---|
| mpn_divexact (new) | 0.058 | 0.531 | 16.4 | 997 | 33,100 | 503,000 | 8,310,000 | 103,000,000 |
| mpz_divexact (4.2) | 0.083 | 0.64 | 28.9 | 2,140 | 67,200 | 1,410,000 | 25,900,000 | 413,000,000 |
| mpn_tdiv_qr (4.2) | 0.106 | 1.19 | 52.4 | 2,140 | 67,200 | 1,410,000 | 25,900,000 | 413,000,000 |
| dividend divisor | 1100 100 | 11000 1000 | 110000 10000 | 1,100,000 100,000 | 11,000,000 1,000,000 | 110,000,000 10,000,000 | 1,100,000,000 100,000,000 | 11,000,000,000 1,000,000,000 |
| mpn_divexact (new) | 0.247 | 6.66 | 452 | 20,000 | 316,000 | 4,850,000 | 64,000,000 | |
| mpz_divexact (4.2) | 0.281 | 7.72 | 517 | 21,200 | 658,000 | 14,100,000 | 259,000,000 | |
| mpn_tdiv_qr (4.2) | 0.351 | 10.2 | 559 | 21,200 | 658,000 | 14,100,000 | 259,000,000 |
| dividend divisor | 200 100 | 2000 1000 | 20000 10000 | 200,000 100,000 | 2,000,000 1,000,000 | 20,000,000 10,000,000 | 200,000,000 100,000,000 | 2,000,000,000 1,000,000,000 |
|---|---|---|---|---|---|---|---|---|
| mpn_sb_div_qr | 0.0597 | 0.983 | 56.5 | 5,250 | 871,000 | |||
| mpn_dc_div_qr | 50.4 | 2,120 | 67,800 | 1,430,000 | 25,700,000 | |||
| mpn_mu_div_qr | 70.7 | 2,440 | 48,100 | 736,000 | 10,700,000 | 133,000,000 | ||
| mpn_tdiv_qr (4.2) | 0.0905 | 1.19 | 51.5 | 2,120 | 67,200 | 1,400,000 | 25,900,000 | 413,000,000 |
| dividend divisor | 1100 100 | 11000 1000 | 110000 10000 | 1,100,000 100,000 | 11,000,000 1,000,000 | 110,000,000 10,000,000 | 1,100,000,000 100,000,000 | 11,000,000,000 1,000,000,000 |
| mpn_sb_div_qr | 0.371 | 9.14 | 561 | 53,000 | ||||
| mpn_dc_div_qr | 510 | 21,100 | 674,000 | 14,200,000 | 258,000,000 | |||
| mpn_mu_div_qr | 625 | 17,300 | 299,000 | 4,400,000 | 64,500,000 | |||
| mpn_tdiv_qr (4.2) | 0.380 | 10.4 | 517 | 21,200 | 668,000 | 14,200,000 | 259,000,000 |
Both mpn_set_str and mpn_get_str are being
rewritten, mainly for improving the memory economy and the performance, but
also for making performance and memory use more linear in operand size. The
new code is 20% to 30% faster than the old code for typical input, and it uses
less than half the amount of memory.
The code in present GMP releases computes a series of powers of the conversion base B,
1 2 4 (2^k)
B , B , B , ... B
and then multiplies (mpn_set_str) or divides
(mpn_get_str) by these powers.
The new code computes a better tuned series, where the topmost power is in the order of the square root of the converted number.
The update of 2007-10-28 implements a trimmed table of powers, no longer with a large number of zero bits at the low end, greatly improving speed for even bases (such as 10).
When the division code (see above) is completed, mpn_get_str
will become asymptotically faster. We will also at some point improve small
operands code by rewriting the base case code to convert the numbers into
fractions, and use
mpn_mul_1 for developing digits.
| Download files | Last changed | What changed? |
|---|---|---|
| mpn/generic/set_str.c | 2007-10-28 | optimization, alloc bug fix |
| mpn/generic/get_str.c | 2007-10-28 | optimization |
In order to use the download files, you need to put these lines in gmp-impl.h:
/* Definitions for mpn_set_str and mpn_get_str */
struct powers
{
mp_ptr p; /* actual power value */
mp_size_t n; /* # of limbs at p */
mp_size_t shift; /* weight of lowest limb, in limb base B */
size_t digits_in_base; /* number of corresponding digits */
int base;
};
typedef struct powers powers_t;
#define mpn_dc_set_str_powtab_alloc(n) ((n) + GMP_LIMB_BITS)
#define mpn_dc_set_str_itch(n) (n)
#define mpn_dc_get_str_powtab_alloc(n) ((n) + 2 * GMP_LIMB_BITS)
#define mpn_dc_get_str_itch(n) ((n) + GMP_LIMB_BITS)
#ifndef GET_STR_DC_THRESHOLD
#define GET_STR_DC_THRESHOLD 18
#endif
#ifndef GET_STR_PRECOMPUTE_THRESHOLD
#define GET_STR_PRECOMPUTE_THRESHOLD 35
#endif
#ifndef SET_STR_DC_THRESHOLD
#define SET_STR_DC_THRESHOLD 750
#endif
#ifndef SET_STR_PRECOMPUTE_THRESHOLD
#define SET_STR_PRECOMPUTE_THRESHOLD 2000
#endif
This is the current development version of the
file mpn/generic/mul_fft.c. The most important improvement here
is the choice of the k parameter. With the right tables in
mpn/MACHINE/gmp-mparam.h, this can smooth out the staircase effect.
There are also other changes, that slightly improves locality, and other
changes that decreases some overhead.
| Download files | Last changed | What changed? |
|---|---|---|
| mpn/generic/mul_fft.c | 2007-09-27 | speedup |
Here we'll link to some developments by external groups or people. These developments have not been tested by the GMP team, therefore we cannot make any assessments about their quality.