[Gmp-commit] /var/hg/gmp: 3 new changesets

Mon Dec 5 21:53:14 CET 2011

details:   /var/hg/gmp/rev/ebb24f6a78a8
changeset: 14524:ebb24f6a78a8
user:      Torbjorn Granlund <tege at gmplib.org>
date:      Mon Dec 05 21:43:52 2011 +0100
description:
Remove lots of cmpleted projects, add a few new ones.

details:   /var/hg/gmp/rev/e4063cfbff46
changeset: 14525:e4063cfbff46
user:      Torbjorn Granlund <tege at gmplib.org>
date:      Mon Dec 05 21:44:21 2011 +0100
description:
Clean up some tasks, file is still severely outdated.

details:   /var/hg/gmp/rev/9c3b5c259370
changeset: 14526:9c3b5c259370
user:      Torbjorn Granlund <tege at gmplib.org>
date:      Mon Dec 05 21:53:11 2011 +0100
description:
Update fat function set to be more relevant.

diffstat:

 ChangeLog                 |   20 ++++
 configure.in              |   26 ++++-
 doc/projects.html         |  203 ++++++++-------------------------------------
 doc/tasks.html            |    9 +-
 gmp-impl.h                |   59 ++++++++++++-
 mpn/x86/fat/diveby3.c     |   21 ----
 mpn/x86/fat/fat.c         |   10 ++-
 mpn/x86/fat/lshiftc.c     |   21 ++++
 mpn/x86/fat/mod_1_1.c     |   25 +++++
 mpn/x86/fat/mod_1_2.c     |   25 +++++
 mpn/x86/fat/mod_1_4.c     |   25 +++++
 mpn/x86/x86-defs.m4       |   10 ++-
 mpn/x86_64/fat/diveby3.c  |   21 ----
 mpn/x86_64/fat/fat.c      |   10 ++-
 mpn/x86_64/fat/gcd_1.c    |   21 ----
 mpn/x86_64/fat/mode1o.c   |   21 ----
 mpn/x86_64/x86_64-defs.m4 |   10 ++-
 17 files changed, 265 insertions(+), 272 deletions(-)

diffs (truncated from 865 to 300 lines):

diff -r be5497cb0b23 -r 9c3b5c259370 ChangeLog

--- a/ChangeLog	Mon Dec 05 14:29:22 2011 +0100
+++ b/ChangeLog	Mon Dec 05 21:53:11 2011 +0100
@@ -1,5 +1,25 @@
 2011-12-05  Torbjorn Granlund  <tege at gmplib.org>
 
+	* mpn/x86/fat/lshiftc.c: New file.
+	* mpn/x86/fat/mod_1_1.c: New file.
+	* mpn/x86/fat/mod_1_2.c: New file.
+	* mpn/x86/fat/mod_1_4.c: New file.
+
+	* mpn/x86/fat/diveby3.c: Remove no longer fat function.
+	* mpn/x86_64/fat/diveby3.c: Likewise.
+
+	* mpn/x86_64/fat/gcd_1.c: Remove since always provided as asm.
+	* mpn/x86_64/fat/mode1o.c: Likewise.
+
+	* configure.in (fat_functions): Update to more relevant function set.
+	Add special handling for mod_1_N_cps functions.
+	* gmp-impl.h (struct cpuvec_t) : Corresponding changes.  Also add
+	vrious declarations for new functions.
+	* mpn/x86/x86-defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+	* mpn/x86_64/x86_64-defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+	* mpn/x86/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+	* mpn/x86_64/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+
 	* mpn/x86_64: Port most remaining x86_64 files to DOS64.
 
 	* mpn/x86_64/coreisbr/aors_n.asm: Add forgotten DOS64_EXIT.
diff -r be5497cb0b23 -r 9c3b5c259370 configure.in
--- a/configure.in	Mon Dec 05 14:29:22 2011 +0100
+++ b/configure.in	Mon Dec 05 21:53:11 2011 +0100
@@ -1934,10 +1934,11 @@
 	fat_path="x86_64 x86_64/fat x86_64/pentium4 x86_64/core2 x86_64/coreinhm x86_64/coreisbr x86_64/atom x86_64/nano"
       fi
 
-      fat_functions="add_n addmul_1 copyd copyi
-		     dive_1 diveby3 divrem_1 gcd_1 lshift
-		     mod_1 mod_34lsub1 mode1o mul_1 mul_basecase
-		     pre_divrem_1 pre_mod_1 rshift
+      fat_functions="add_n addmul_1 bdiv_dbm1c com copyd copyi
+		     dive_1 divrem_1 gcd_1 lshift lshiftc mod_1
+		     mod_1_1 mod_1_1_cps mod_1_2 mod_1_2_cps mod_1_4 mod_1_4_cps
+		     mod_34lsub1 mode1o
+		     mul_1 mul_basecase pre_divrem_1 pre_mod_1 rshift
 		     sqr_basecase sub_n submul_1"
       fat_thresholds="MUL_TOOM22_THRESHOLD MUL_TOOM33_THRESHOLD
 		      SQR_TOOM2_THRESHOLD SQR_TOOM3_THRESHOLD
@@ -2732,6 +2733,14 @@
   pre_divrem_1) $1=preinv_divrem_1 ;;
   mode1o)	$1=modexact_1c_odd ;;
   pre_mod_1)	$1=preinv_mod_1 ;;
+  mod_1_1)	$1=mod_1_1p ;;
+  mod_1_1_cps)	$1=mod_1_1p_cps ;;
+  mod_1_2)	$1=mod_1s_2p ;;
+  mod_1_2_cps)	$1=mod_1s_2p_cps ;;
+  mod_1_3)	$1=mod_1s_3p ;;
+  mod_1_3_cps)	$1=mod_1s_3p_cps ;;
+  mod_1_4)	$1=mod_1s_4p ;;
+  mod_1_4_cps)	$1=mod_1s_4p_cps ;;
   *)		$1=$$2 ;;
 esac
 ])
@@ -2933,6 +2942,7 @@
 define(__gmpn_$tmp_fbase, __gmpn_${tmp_fbase}_$tmp_suffix)
 define(__gmpn_$tmp_fbasec,__gmpn_${tmp_fbasec}_${tmp_suffix})
 define(__gmpn_preinv_${tmp_fbase},__gmpn_preinv_${tmp_fbase}_${tmp_suffix})
+define(__gmpn_${tmp_fbase}_cps,__gmpn_${tmp_fbase}_cps_${tmp_suffix})
 
 $tmp_d_n_l  For k6 and k7 gcd_1 calling their corresponding mpn_modexact_1_odd
 ifdef(\`__gmpn_modexact_1_odd',,
@@ -2950,6 +2960,7 @@
 #define __gmpn_$tmp_fbase           __gmpn_${tmp_fbase}_$tmp_suffix
 #define __gmpn_$tmp_fbasec          __gmpn_${tmp_fbasec}_${tmp_suffix}
 #define __gmpn_preinv_${tmp_fbase}  __gmpn_preinv_${tmp_fbase}_${tmp_suffix}
+#define __gmpn_${tmp_fbase}_cps     __gmpn_${tmp_fbase}_cps_${tmp_suffix}
 
 #include \"$mpn_relative_top_srcdir/mpn/$tmp_dir/$tmp_base.c\"
 "] >mpn/${tmp_prefix}_$tmp_fn.c
@@ -2966,6 +2977,13 @@
               CPUVEC_SETUP="$CPUVEC_SETUP    decided_cpuvec.preinv_$tmp_fbase = __gmpn_preinv_${tmp_fbase}_${tmp_suffix}; \\
 "
             fi
+
+            # Ditto for any mod_1...cps variant
+            if grep "^PROLOGUE(mpn_${tmp_fbase}_cps)" $tmp_file >/dev/null; then
+              echo "DECL_${tmp_fbase}_cps (__gmpn_${tmp_fbase}_cps_$tmp_suffix);" >>fat.h
+              CPUVEC_SETUP="$CPUVEC_SETUP    decided_cpuvec.${tmp_fbase}_cps = __gmpn_${tmp_fbase}_cps_${tmp_suffix}; \\
+"
+            fi
           fi
         done
       done
diff -r be5497cb0b23 -r 9c3b5c259370 doc/projects.html
--- a/doc/projects.html	Mon Dec 05 14:29:22 2011 +0100
+++ b/doc/projects.html	Mon Dec 05 21:53:11 2011 +0100
@@ -15,8 +15,8 @@
 
 <font size=-1>
 <pre>
-Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software
-Foundation, Inc.
+Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009, 2010, 2011
+Free Software Foundation, Inc.
 
 This file is part of the GNU MP Library.
 
@@ -37,7 +37,7 @@
 
 <hr>
 <!-- NB. timestamp updated automatically by emacs -->
-  This file current as of 15 Nov 2009.  An up-to-date version is available at
+  This file current as of 5 Dec 2011.  An up-to-date version is available at
   <a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>.
   Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
 
@@ -53,27 +53,9 @@
 <ul>
 <li> <strong>Faster multiplication</strong>
 
-  <p> The current multiplication code uses Karatsuba, 3-way and 4-way Toom, and
-      Fermat FFT.  Several new developments are desirable:
-
   <ol>
 
-    <li> Write more toom multiply functions for unbalanced operands.  We now have
-	 toom22, toom32, toom42, toom62, toom33, toom53, and toom44.  Most
-	 desirable is toom43, which will require a new toom_interpolate_6pts
-	 function.  Writing toom52 will then be straightforward.  See also
-	 <a href="http://bodrato.it/software/toom.html">Marco Bodrato's
-	 site</a>
-
-    <li> Perhaps consider N-way Toom, N > 4.  See Knuth's Seminumerical
-	 Algorithms for details on the method, as well as Bodrato's site.  Code
-	 implementing it exists.  This is asymptotically inferior to FFTs, but
-	 is finer grained.
-
-    <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42
-	 for unbalanced operations.  We don't use any of the other new toom
-	 functions currently.  Write new clever code for choosing the best toom
-	 function from an m-limb and an n-limb operand.
+    <li> Work on the algorithm selection code for unbalanced multiplication.
 
     <li> Implement an FFT variant computing the coefficients mod m different
 	 limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
@@ -92,16 +74,8 @@
 	 <p> [We now have two implementations of this algorithm, one by Tommy
 	 Färnqvist and one by Niels Möller.]
 
-    <li> Add support for short products, either a given number of low limbs, a
-	 given number of high limbs, or perhaps the middle limbs of the result.
-	 High short product can be used by <code>mpf_mul</code>, by
-	 left-to-right Newton approximations, and for quotient approximation.
-	 Low half short product can be of use in sub-quadratic REDC and for
-	 right-to-left Newton approximations.  On small sizes a short product
-	 will be faster simply through fewer cross-products, similar to the way
-	 squaring is faster.  But work by Thom Mulders shows that for Karatsuba
-	 and higher order algorithms the advantage is progressively lost, so
-	 for large sizes shows products turn out to be no faster.
+    <li> Work on short products.  Our mullo and mulmid are probably K, but we
+         lack mulhi.
 
   </ol>
 
@@ -121,8 +95,8 @@
 
   <p> Please make sure your new routines are fast for these three situations:
       <ol>
-	<li> Operands that fit into the cache.
 	<li> Small operands of less than, say, 10 limbs.
+	<li> Medium size operands, that fit into the cache.
 	<li> Huge operands that does not fit into the cache.
       </ol>
 
@@ -145,18 +119,6 @@
       21-bit pieces if one allows the split operands to be negative!)
 
 
-<li> <strong>Math functions for the mpf layer</strong>
-
-  <p> Implement the functions of math.h for the GMP mpf layer!	Check the book
-      "Pi and the AGM" by Borwein and Borwein for ideas how to do this.  These
-      functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
-      cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
-
-  <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already
-  provide these functions, and that we usually recommend new programs to use
-  mpfr instead of mpf.
-
-
 <li> <strong>Faster sqrt</strong>
 
   <p> The current code uses divisions, which are reasonably fast, but it'd be
@@ -180,9 +142,24 @@
 
 <li> <strong>Nth root</strong>
 
-  <p> Improve mpn_rootrem.  The current code is not too bad, but its average
-      time complexity is a function of the input, while it is possible to
-      make it a function of the output.
+  <p> Improve mpn_rootrem.  The current code is not too bad, but its time
+      complexity is a function of the input, while it is possible to make
+      the <i>average</i> complexity a function of the output.
+
+
+<li> <strong>Fat binaries</strong>
+
+  <p> Add more functions to the set of fat functions.
+
+  <p> The speed of multipliciaton is today highly dependent on combination
+  functions like <code>addlsh1_n</code>.  A fat binary will never use any such
+  functions, since they are classified as optional.  Ideally, we should use
+  them, but making the current compile-time selections of optional functions
+  become run-time selections for fat binaries.
+
+  <p> If we make fat binaries work really well, we should move away frm tehe
+  current configure scheme (at least by default) and instead include all code
+  always.
 
 
 <li> <strong>Exceptions</strong>
@@ -343,131 +320,15 @@
       <code>gmp_restrict</code>.
 
 
-<li> <strong>Nx1 Division</strong>
-
-  <p> The limb-by-limb dependencies in the existing Nx1 division (and
-      remainder) code means that chips with multiple execution units or
-      pipelined multipliers are not fully utilized.
-
-  <p> One possibility is to follow the current preinv method but taking two
-      limbs at a time.  That means a 2x2->4 and a 2x1->2 multiply for
-      each two limbs processed, and because the 2x2 and 2x1 can each be done in
-      parallel the latency will be not much more than 2 multiplies for two
-      limbs, whereas the single limb method has a 2 multiply latency for just
-      one limb.  A version of <code>mpn_divrem_1</code> doing this has been
-      written in C, but not yet tested on likely chips.  Clearly this scheme
-      would extend to 3x3->9 and 3x1->3 etc, though with diminishing
-      returns.
-
-  <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
-      scheme.  For a limb R=2^<code>bits_per_mp_limb</code>, pre-calculate
-      values R mod N, R^2 mod N, R^3 mod N, R^4 mod N.  Then take dividend
-      limbs and multiply them by those values, thereby reducing them (moving
-      them down) by the corresponding factor.  The products can be added to
-      produce an intermediate remainder of 2 or 3 limbs to be similarly
-      included in the next step.  The point is that such multiplies can be done
-      in parallel, meaning as little as 1 multiply worth of latency for 4
-      limbs.  If the modulus N is less than R/4 (or is it R/5?) the summed
-      products will fit in 2 limbs, otherwise 3 will be required, but with the
-      high only being small.  Clearly this extends to as many factors of R as a
-      chip can efficiently apply.
-
-  <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
-      for i up to k, the size of the dividend.  This could then be applied at
-      multiplier throughput speed like an inner product.  If the powers took
-      roughly k divide steps to calculate then there'd be an advantage any time
-      the same N was used three or more times.  Suggested by Victor Shoup in
-      connection with chinese-remainder style decompositions, but perhaps with
-      other uses.
-
-  <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d
-      satisfying a = q*d + x*b^n, where b=2^bits_per_limb.  The factor b^n
-      needed to get the true remainder r could be calculated by a powering
-      algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
-      service for an <code>mpn_mod_1</code>.  <code>modexact_1</code> is
-      simpler and on some chips can run noticeably faster than plain
-      <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
-      Such a difference could soon overcome the time to calculate b^n.  The
-      requirement for an odd divisor in <code>modexact</code> can be handled by
-      some shifting on-the-fly, or perhaps by an extra partial-limb step at the
-      end.
-
 
 <li> <strong>Factorial</strong>
 
-  <p> The removal of twos in the current code could be extended to factors of 3
-      or 5.  Taking this to its logical conclusion would be a complete
-      decomposition into powers of primes.  The power for a prime p is of
-      course floor(n/p)+floor(n/p^2)+...  Conrad Curry found this is quite fast
-      (using simultaneous powering as per Handbook of Applied Cryptography
-      algorithm 14.88).
-
-  <p> A difficulty with using all primes is that quite large n can be
-      calculated on a system with enough memory, larger than we'd probably want
-      for a table of primes, so some sort of sieving would be wanted.  Perhaps
-      just taking out the factors of 3 and 5 would give most of the speedup
-      that a prime decomposition can offer.
+  <p> Rewrite for simplicty and speed.  Work is in progress.
 
 
 <li> <strong>Binomial Coefficients</strong>
 
-  <p> An obvious improvement to the current code would be to strip factors of 2
-      from each multiplier and divisor and count them separately, to be applied
-      with a bit shift at the end.  Factors of 3 and perhaps 5 could even be
-      handled similarly.
-
-  <p> Conrad Curry reports a big speedup for binomial coefficients using a