[Gmp-commit] /var/hg/gmp: 3 new changesets
mercurial at gmplib.org
mercurial at gmplib.org
Mon Dec 5 21:53:14 CET 2011
details: /var/hg/gmp/rev/ebb24f6a78a8
changeset: 14524:ebb24f6a78a8
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:43:52 2011 +0100
description:
Remove lots of cmpleted projects, add a few new ones.
details: /var/hg/gmp/rev/e4063cfbff46
changeset: 14525:e4063cfbff46
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:44:21 2011 +0100
description:
Clean up some tasks, file is still severely outdated.
details: /var/hg/gmp/rev/9c3b5c259370
changeset: 14526:9c3b5c259370
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:53:11 2011 +0100
description:
Update fat function set to be more relevant.
diffstat:
ChangeLog | 20 ++++
configure.in | 26 ++++-
doc/projects.html | 203 ++++++++-------------------------------------
doc/tasks.html | 9 +-
gmp-impl.h | 59 ++++++++++++-
mpn/x86/fat/diveby3.c | 21 ----
mpn/x86/fat/fat.c | 10 ++-
mpn/x86/fat/lshiftc.c | 21 ++++
mpn/x86/fat/mod_1_1.c | 25 +++++
mpn/x86/fat/mod_1_2.c | 25 +++++
mpn/x86/fat/mod_1_4.c | 25 +++++
mpn/x86/x86-defs.m4 | 10 ++-
mpn/x86_64/fat/diveby3.c | 21 ----
mpn/x86_64/fat/fat.c | 10 ++-
mpn/x86_64/fat/gcd_1.c | 21 ----
mpn/x86_64/fat/mode1o.c | 21 ----
mpn/x86_64/x86_64-defs.m4 | 10 ++-
17 files changed, 265 insertions(+), 272 deletions(-)
diffs (truncated from 865 to 300 lines):
diff -r be5497cb0b23 -r 9c3b5c259370 ChangeLog
--- a/ChangeLog Mon Dec 05 14:29:22 2011 +0100
+++ b/ChangeLog Mon Dec 05 21:53:11 2011 +0100
@@ -1,5 +1,25 @@
2011-12-05 Torbjorn Granlund <tege at gmplib.org>
+ * mpn/x86/fat/lshiftc.c: New file.
+ * mpn/x86/fat/mod_1_1.c: New file.
+ * mpn/x86/fat/mod_1_2.c: New file.
+ * mpn/x86/fat/mod_1_4.c: New file.
+
+ * mpn/x86/fat/diveby3.c: Remove no longer fat function.
+ * mpn/x86_64/fat/diveby3.c: Likewise.
+
+ * mpn/x86_64/fat/gcd_1.c: Remove since always provided as asm.
+ * mpn/x86_64/fat/mode1o.c: Likewise.
+
+ * configure.in (fat_functions): Update to more relevant function set.
+ Add special handling for mod_1_N_cps functions.
+ * gmp-impl.h (struct cpuvec_t) : Corresponding changes. Also add
+ vrious declarations for new functions.
+ * mpn/x86/x86-defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+ * mpn/x86_64/x86_64-defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+ * mpn/x86/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+ * mpn/x86_64/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+
* mpn/x86_64: Port most remaining x86_64 files to DOS64.
* mpn/x86_64/coreisbr/aors_n.asm: Add forgotten DOS64_EXIT.
diff -r be5497cb0b23 -r 9c3b5c259370 configure.in
--- a/configure.in Mon Dec 05 14:29:22 2011 +0100
+++ b/configure.in Mon Dec 05 21:53:11 2011 +0100
@@ -1934,10 +1934,11 @@
fat_path="x86_64 x86_64/fat x86_64/pentium4 x86_64/core2 x86_64/coreinhm x86_64/coreisbr x86_64/atom x86_64/nano"
fi
- fat_functions="add_n addmul_1 copyd copyi
- dive_1 diveby3 divrem_1 gcd_1 lshift
- mod_1 mod_34lsub1 mode1o mul_1 mul_basecase
- pre_divrem_1 pre_mod_1 rshift
+ fat_functions="add_n addmul_1 bdiv_dbm1c com copyd copyi
+ dive_1 divrem_1 gcd_1 lshift lshiftc mod_1
+ mod_1_1 mod_1_1_cps mod_1_2 mod_1_2_cps mod_1_4 mod_1_4_cps
+ mod_34lsub1 mode1o
+ mul_1 mul_basecase pre_divrem_1 pre_mod_1 rshift
sqr_basecase sub_n submul_1"
fat_thresholds="MUL_TOOM22_THRESHOLD MUL_TOOM33_THRESHOLD
SQR_TOOM2_THRESHOLD SQR_TOOM3_THRESHOLD
@@ -2732,6 +2733,14 @@
pre_divrem_1) $1=preinv_divrem_1 ;;
mode1o) $1=modexact_1c_odd ;;
pre_mod_1) $1=preinv_mod_1 ;;
+ mod_1_1) $1=mod_1_1p ;;
+ mod_1_1_cps) $1=mod_1_1p_cps ;;
+ mod_1_2) $1=mod_1s_2p ;;
+ mod_1_2_cps) $1=mod_1s_2p_cps ;;
+ mod_1_3) $1=mod_1s_3p ;;
+ mod_1_3_cps) $1=mod_1s_3p_cps ;;
+ mod_1_4) $1=mod_1s_4p ;;
+ mod_1_4_cps) $1=mod_1s_4p_cps ;;
*) $1=$$2 ;;
esac
])
@@ -2933,6 +2942,7 @@
define(__gmpn_$tmp_fbase, __gmpn_${tmp_fbase}_$tmp_suffix)
define(__gmpn_$tmp_fbasec,__gmpn_${tmp_fbasec}_${tmp_suffix})
define(__gmpn_preinv_${tmp_fbase},__gmpn_preinv_${tmp_fbase}_${tmp_suffix})
+define(__gmpn_${tmp_fbase}_cps,__gmpn_${tmp_fbase}_cps_${tmp_suffix})
$tmp_d_n_l For k6 and k7 gcd_1 calling their corresponding mpn_modexact_1_odd
ifdef(\`__gmpn_modexact_1_odd',,
@@ -2950,6 +2960,7 @@
#define __gmpn_$tmp_fbase __gmpn_${tmp_fbase}_$tmp_suffix
#define __gmpn_$tmp_fbasec __gmpn_${tmp_fbasec}_${tmp_suffix}
#define __gmpn_preinv_${tmp_fbase} __gmpn_preinv_${tmp_fbase}_${tmp_suffix}
+#define __gmpn_${tmp_fbase}_cps __gmpn_${tmp_fbase}_cps_${tmp_suffix}
#include \"$mpn_relative_top_srcdir/mpn/$tmp_dir/$tmp_base.c\"
"] >mpn/${tmp_prefix}_$tmp_fn.c
@@ -2966,6 +2977,13 @@
CPUVEC_SETUP="$CPUVEC_SETUP decided_cpuvec.preinv_$tmp_fbase = __gmpn_preinv_${tmp_fbase}_${tmp_suffix}; \\
"
fi
+
+ # Ditto for any mod_1...cps variant
+ if grep "^PROLOGUE(mpn_${tmp_fbase}_cps)" $tmp_file >/dev/null; then
+ echo "DECL_${tmp_fbase}_cps (__gmpn_${tmp_fbase}_cps_$tmp_suffix);" >>fat.h
+ CPUVEC_SETUP="$CPUVEC_SETUP decided_cpuvec.${tmp_fbase}_cps = __gmpn_${tmp_fbase}_cps_${tmp_suffix}; \\
+"
+ fi
fi
done
done
diff -r be5497cb0b23 -r 9c3b5c259370 doc/projects.html
--- a/doc/projects.html Mon Dec 05 14:29:22 2011 +0100
+++ b/doc/projects.html Mon Dec 05 21:53:11 2011 +0100
@@ -15,8 +15,8 @@
<font size=-1>
<pre>
-Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software
-Foundation, Inc.
+Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009, 2010, 2011
+Free Software Foundation, Inc.
This file is part of the GNU MP Library.
@@ -37,7 +37,7 @@
<hr>
<!-- NB. timestamp updated automatically by emacs -->
- This file current as of 15 Nov 2009. An up-to-date version is available at
+ This file current as of 5 Dec 2011. An up-to-date version is available at
<a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>.
Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
@@ -53,27 +53,9 @@
<ul>
<li> <strong>Faster multiplication</strong>
- <p> The current multiplication code uses Karatsuba, 3-way and 4-way Toom, and
- Fermat FFT. Several new developments are desirable:
-
<ol>
- <li> Write more toom multiply functions for unbalanced operands. We now have
- toom22, toom32, toom42, toom62, toom33, toom53, and toom44. Most
- desirable is toom43, which will require a new toom_interpolate_6pts
- function. Writing toom52 will then be straightforward. See also
- <a href="http://bodrato.it/software/toom.html">Marco Bodrato's
- site</a>
-
- <li> Perhaps consider N-way Toom, N > 4. See Knuth's Seminumerical
- Algorithms for details on the method, as well as Bodrato's site. Code
- implementing it exists. This is asymptotically inferior to FFTs, but
- is finer grained.
-
- <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42
- for unbalanced operations. We don't use any of the other new toom
- functions currently. Write new clever code for choosing the best toom
- function from an m-limb and an n-limb operand.
+ <li> Work on the algorithm selection code for unbalanced multiplication.
<li> Implement an FFT variant computing the coefficients mod m different
limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
@@ -92,16 +74,8 @@
<p> [We now have two implementations of this algorithm, one by Tommy
Färnqvist and one by Niels Möller.]
- <li> Add support for short products, either a given number of low limbs, a
- given number of high limbs, or perhaps the middle limbs of the result.
- High short product can be used by <code>mpf_mul</code>, by
- left-to-right Newton approximations, and for quotient approximation.
- Low half short product can be of use in sub-quadratic REDC and for
- right-to-left Newton approximations. On small sizes a short product
- will be faster simply through fewer cross-products, similar to the way
- squaring is faster. But work by Thom Mulders shows that for Karatsuba
- and higher order algorithms the advantage is progressively lost, so
- for large sizes shows products turn out to be no faster.
+ <li> Work on short products. Our mullo and mulmid are probably K, but we
+ lack mulhi.
</ol>
@@ -121,8 +95,8 @@
<p> Please make sure your new routines are fast for these three situations:
<ol>
- <li> Operands that fit into the cache.
<li> Small operands of less than, say, 10 limbs.
+ <li> Medium size operands, that fit into the cache.
<li> Huge operands that does not fit into the cache.
</ol>
@@ -145,18 +119,6 @@
21-bit pieces if one allows the split operands to be negative!)
-<li> <strong>Math functions for the mpf layer</strong>
-
- <p> Implement the functions of math.h for the GMP mpf layer! Check the book
- "Pi and the AGM" by Borwein and Borwein for ideas how to do this. These
- functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
- cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
-
- <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already
- provide these functions, and that we usually recommend new programs to use
- mpfr instead of mpf.
-
-
<li> <strong>Faster sqrt</strong>
<p> The current code uses divisions, which are reasonably fast, but it'd be
@@ -180,9 +142,24 @@
<li> <strong>Nth root</strong>
- <p> Improve mpn_rootrem. The current code is not too bad, but its average
- time complexity is a function of the input, while it is possible to
- make it a function of the output.
+ <p> Improve mpn_rootrem. The current code is not too bad, but its time
+ complexity is a function of the input, while it is possible to make
+ the <i>average</i> complexity a function of the output.
+
+
+<li> <strong>Fat binaries</strong>
+
+ <p> Add more functions to the set of fat functions.
+
+ <p> The speed of multipliciaton is today highly dependent on combination
+ functions like <code>addlsh1_n</code>. A fat binary will never use any such
+ functions, since they are classified as optional. Ideally, we should use
+ them, but making the current compile-time selections of optional functions
+ become run-time selections for fat binaries.
+
+ <p> If we make fat binaries work really well, we should move away frm tehe
+ current configure scheme (at least by default) and instead include all code
+ always.
<li> <strong>Exceptions</strong>
@@ -343,131 +320,15 @@
<code>gmp_restrict</code>.
-<li> <strong>Nx1 Division</strong>
-
- <p> The limb-by-limb dependencies in the existing Nx1 division (and
- remainder) code means that chips with multiple execution units or
- pipelined multipliers are not fully utilized.
-
- <p> One possibility is to follow the current preinv method but taking two
- limbs at a time. That means a 2x2->4 and a 2x1->2 multiply for
- each two limbs processed, and because the 2x2 and 2x1 can each be done in
- parallel the latency will be not much more than 2 multiplies for two
- limbs, whereas the single limb method has a 2 multiply latency for just
- one limb. A version of <code>mpn_divrem_1</code> doing this has been
- written in C, but not yet tested on likely chips. Clearly this scheme
- would extend to 3x3->9 and 3x1->3 etc, though with diminishing
- returns.
-
- <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
- scheme. For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate
- values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend
- limbs and multiply them by those values, thereby reducing them (moving
- them down) by the corresponding factor. The products can be added to
- produce an intermediate remainder of 2 or 3 limbs to be similarly
- included in the next step. The point is that such multiplies can be done
- in parallel, meaning as little as 1 multiply worth of latency for 4
- limbs. If the modulus N is less than R/4 (or is it R/5?) the summed
- products will fit in 2 limbs, otherwise 3 will be required, but with the
- high only being small. Clearly this extends to as many factors of R as a
- chip can efficiently apply.
-
- <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
- for i up to k, the size of the dividend. This could then be applied at
- multiplier throughput speed like an inner product. If the powers took
- roughly k divide steps to calculate then there'd be an advantage any time
- the same N was used three or more times. Suggested by Victor Shoup in
- connection with chinese-remainder style decompositions, but perhaps with
- other uses.
-
- <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d
- satisfying a = q*d + x*b^n, where b=2^bits_per_limb. The factor b^n
- needed to get the true remainder r could be calculated by a powering
- algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
- service for an <code>mpn_mod_1</code>. <code>modexact_1</code> is
- simpler and on some chips can run noticeably faster than plain
- <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
- Such a difference could soon overcome the time to calculate b^n. The
- requirement for an odd divisor in <code>modexact</code> can be handled by
- some shifting on-the-fly, or perhaps by an extra partial-limb step at the
- end.
-
<li> <strong>Factorial</strong>
- <p> The removal of twos in the current code could be extended to factors of 3
- or 5. Taking this to its logical conclusion would be a complete
- decomposition into powers of primes. The power for a prime p is of
- course floor(n/p)+floor(n/p^2)+... Conrad Curry found this is quite fast
- (using simultaneous powering as per Handbook of Applied Cryptography
- algorithm 14.88).
-
- <p> A difficulty with using all primes is that quite large n can be
- calculated on a system with enough memory, larger than we'd probably want
- for a table of primes, so some sort of sieving would be wanted. Perhaps
- just taking out the factors of 3 and 5 would give most of the speedup
- that a prime decomposition can offer.
+ <p> Rewrite for simplicty and speed. Work is in progress.
<li> <strong>Binomial Coefficients</strong>
- <p> An obvious improvement to the current code would be to strip factors of 2
- from each multiplier and divisor and count them separately, to be applied
- with a bit shift at the end. Factors of 3 and perhaps 5 could even be
- handled similarly.
-
- <p> Conrad Curry reports a big speedup for binomial coefficients using a
More information about the gmp-commit
mailing list