[Gmpcommit] /var/hg/gmp: 3 new changesets
mercurial at gmplib.org
mercurial at gmplib.org
Mon Dec 5 21:53:14 CET 2011
details: /var/hg/gmp/rev/ebb24f6a78a8
changeset: 14524:ebb24f6a78a8
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:43:52 2011 +0100
description:
Remove lots of cmpleted projects, add a few new ones.
details: /var/hg/gmp/rev/e4063cfbff46
changeset: 14525:e4063cfbff46
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:44:21 2011 +0100
description:
Clean up some tasks, file is still severely outdated.
details: /var/hg/gmp/rev/9c3b5c259370
changeset: 14526:9c3b5c259370
user: Torbjorn Granlund <tege at gmplib.org>
date: Mon Dec 05 21:53:11 2011 +0100
description:
Update fat function set to be more relevant.
diffstat:
ChangeLog  20 ++++
configure.in  26 ++++
doc/projects.html  203 ++++++++
doc/tasks.html  9 +
gmpimpl.h  59 ++++++++++++
mpn/x86/fat/diveby3.c  21 
mpn/x86/fat/fat.c  10 ++
mpn/x86/fat/lshiftc.c  21 ++++
mpn/x86/fat/mod_1_1.c  25 +++++
mpn/x86/fat/mod_1_2.c  25 +++++
mpn/x86/fat/mod_1_4.c  25 +++++
mpn/x86/x86defs.m4  10 ++
mpn/x86_64/fat/diveby3.c  21 
mpn/x86_64/fat/fat.c  10 ++
mpn/x86_64/fat/gcd_1.c  21 
mpn/x86_64/fat/mode1o.c  21 
mpn/x86_64/x86_64defs.m4  10 ++
17 files changed, 265 insertions(+), 272 deletions()
diffs (truncated from 865 to 300 lines):
diff r be5497cb0b23 r 9c3b5c259370 ChangeLog
 a/ChangeLog Mon Dec 05 14:29:22 2011 +0100
+++ b/ChangeLog Mon Dec 05 21:53:11 2011 +0100
@@ 1,5 +1,25 @@
20111205 Torbjorn Granlund <tege at gmplib.org>
+ * mpn/x86/fat/lshiftc.c: New file.
+ * mpn/x86/fat/mod_1_1.c: New file.
+ * mpn/x86/fat/mod_1_2.c: New file.
+ * mpn/x86/fat/mod_1_4.c: New file.
+
+ * mpn/x86/fat/diveby3.c: Remove no longer fat function.
+ * mpn/x86_64/fat/diveby3.c: Likewise.
+
+ * mpn/x86_64/fat/gcd_1.c: Remove since always provided as asm.
+ * mpn/x86_64/fat/mode1o.c: Likewise.
+
+ * configure.in (fat_functions): Update to more relevant function set.
+ Add special handling for mod_1_N_cps functions.
+ * gmpimpl.h (struct cpuvec_t) : Corresponding changes. Also add
+ vrious declarations for new functions.
+ * mpn/x86/x86defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+ * mpn/x86_64/x86_64defs.m4 (CPUVEC_FUNCS_LIST): Corresponding changes.
+ * mpn/x86/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+ * mpn/x86_64/fat/fat.c (__gmpn_cpuvec): Corresponding changes.
+
* mpn/x86_64: Port most remaining x86_64 files to DOS64.
* mpn/x86_64/coreisbr/aors_n.asm: Add forgotten DOS64_EXIT.
diff r be5497cb0b23 r 9c3b5c259370 configure.in
 a/configure.in Mon Dec 05 14:29:22 2011 +0100
+++ b/configure.in Mon Dec 05 21:53:11 2011 +0100
@@ 1934,10 +1934,11 @@
fat_path="x86_64 x86_64/fat x86_64/pentium4 x86_64/core2 x86_64/coreinhm x86_64/coreisbr x86_64/atom x86_64/nano"
fi
 fat_functions="add_n addmul_1 copyd copyi
 dive_1 diveby3 divrem_1 gcd_1 lshift
 mod_1 mod_34lsub1 mode1o mul_1 mul_basecase
 pre_divrem_1 pre_mod_1 rshift
+ fat_functions="add_n addmul_1 bdiv_dbm1c com copyd copyi
+ dive_1 divrem_1 gcd_1 lshift lshiftc mod_1
+ mod_1_1 mod_1_1_cps mod_1_2 mod_1_2_cps mod_1_4 mod_1_4_cps
+ mod_34lsub1 mode1o
+ mul_1 mul_basecase pre_divrem_1 pre_mod_1 rshift
sqr_basecase sub_n submul_1"
fat_thresholds="MUL_TOOM22_THRESHOLD MUL_TOOM33_THRESHOLD
SQR_TOOM2_THRESHOLD SQR_TOOM3_THRESHOLD
@@ 2732,6 +2733,14 @@
pre_divrem_1) $1=preinv_divrem_1 ;;
mode1o) $1=modexact_1c_odd ;;
pre_mod_1) $1=preinv_mod_1 ;;
+ mod_1_1) $1=mod_1_1p ;;
+ mod_1_1_cps) $1=mod_1_1p_cps ;;
+ mod_1_2) $1=mod_1s_2p ;;
+ mod_1_2_cps) $1=mod_1s_2p_cps ;;
+ mod_1_3) $1=mod_1s_3p ;;
+ mod_1_3_cps) $1=mod_1s_3p_cps ;;
+ mod_1_4) $1=mod_1s_4p ;;
+ mod_1_4_cps) $1=mod_1s_4p_cps ;;
*) $1=$$2 ;;
esac
])
@@ 2933,6 +2942,7 @@
define(__gmpn_$tmp_fbase, __gmpn_${tmp_fbase}_$tmp_suffix)
define(__gmpn_$tmp_fbasec,__gmpn_${tmp_fbasec}_${tmp_suffix})
define(__gmpn_preinv_${tmp_fbase},__gmpn_preinv_${tmp_fbase}_${tmp_suffix})
+define(__gmpn_${tmp_fbase}_cps,__gmpn_${tmp_fbase}_cps_${tmp_suffix})
$tmp_d_n_l For k6 and k7 gcd_1 calling their corresponding mpn_modexact_1_odd
ifdef(\`__gmpn_modexact_1_odd',,
@@ 2950,6 +2960,7 @@
#define __gmpn_$tmp_fbase __gmpn_${tmp_fbase}_$tmp_suffix
#define __gmpn_$tmp_fbasec __gmpn_${tmp_fbasec}_${tmp_suffix}
#define __gmpn_preinv_${tmp_fbase} __gmpn_preinv_${tmp_fbase}_${tmp_suffix}
+#define __gmpn_${tmp_fbase}_cps __gmpn_${tmp_fbase}_cps_${tmp_suffix}
#include \"$mpn_relative_top_srcdir/mpn/$tmp_dir/$tmp_base.c\"
"] >mpn/${tmp_prefix}_$tmp_fn.c
@@ 2966,6 +2977,13 @@
CPUVEC_SETUP="$CPUVEC_SETUP decided_cpuvec.preinv_$tmp_fbase = __gmpn_preinv_${tmp_fbase}_${tmp_suffix}; \\
"
fi
+
+ # Ditto for any mod_1...cps variant
+ if grep "^PROLOGUE(mpn_${tmp_fbase}_cps)" $tmp_file >/dev/null; then
+ echo "DECL_${tmp_fbase}_cps (__gmpn_${tmp_fbase}_cps_$tmp_suffix);" >>fat.h
+ CPUVEC_SETUP="$CPUVEC_SETUP decided_cpuvec.${tmp_fbase}_cps = __gmpn_${tmp_fbase}_cps_${tmp_suffix}; \\
+"
+ fi
fi
done
done
diff r be5497cb0b23 r 9c3b5c259370 doc/projects.html
 a/doc/projects.html Mon Dec 05 14:29:22 2011 +0100
+++ b/doc/projects.html Mon Dec 05 21:53:11 2011 +0100
@@ 15,8 +15,8 @@
<font size=1>
<pre>
Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009 Free Software
Foundation, Inc.
+Copyright 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009, 2010, 2011
+Free Software Foundation, Inc.
This file is part of the GNU MP Library.
@@ 37,7 +37,7 @@
<hr>
<! NB. timestamp updated automatically by emacs >
 This file current as of 15 Nov 2009. An uptodate version is available at
+ This file current as of 5 Dec 2011. An uptodate version is available at
<a href="http://gmplib.org/projects.html">http://gmplib.org/projects.html</a>.
Please send comments about this page to gmpdevel<font>@</font>gmplib.org.
@@ 53,27 +53,9 @@
<ul>
<li> <strong>Faster multiplication</strong>
 <p> The current multiplication code uses Karatsuba, 3way and 4way Toom, and
 Fermat FFT. Several new developments are desirable:

<ol>
 <li> Write more toom multiply functions for unbalanced operands. We now have
 toom22, toom32, toom42, toom62, toom33, toom53, and toom44. Most
 desirable is toom43, which will require a new toom_interpolate_6pts
 function. Writing toom52 will then be straightforward. See also
 <a href="http://bodrato.it/software/toom.html">Marco Bodrato's
 site</a>

 <li> Perhaps consider Nway Toom, N > 4. See Knuth's Seminumerical
 Algorithms for details on the method, as well as Bodrato's site. Code
 implementing it exists. This is asymptotically inferior to FFTs, but
 is finer grained.

 <li> The mpn_mul call now (from GMP 4.3) uses toom22, toom32, and toom42
 for unbalanced operations. We don't use any of the other new toom
 functions currently. Write new clever code for choosing the best toom
 function from an mlimb and an nlimb operand.
+ <li> Work on the algorithm selection code for unbalanced multiplication.
<li> Implement an FFT variant computing the coefficients mod m different
limb size primes of the form l*2^k+1. i.e., compute m separate FFTs.
@@ 92,16 +74,8 @@
<p> [We now have two implementations of this algorithm, one by Tommy
Färnqvist and one by Niels Möller.]
 <li> Add support for short products, either a given number of low limbs, a
 given number of high limbs, or perhaps the middle limbs of the result.
 High short product can be used by <code>mpf_mul</code>, by
 lefttoright Newton approximations, and for quotient approximation.
 Low half short product can be of use in subquadratic REDC and for
 righttoleft Newton approximations. On small sizes a short product
 will be faster simply through fewer crossproducts, similar to the way
 squaring is faster. But work by Thom Mulders shows that for Karatsuba
 and higher order algorithms the advantage is progressively lost, so
 for large sizes shows products turn out to be no faster.
+ <li> Work on short products. Our mullo and mulmid are probably K, but we
+ lack mulhi.
</ol>
@@ 121,8 +95,8 @@
<p> Please make sure your new routines are fast for these three situations:
<ol>
 <li> Operands that fit into the cache.
<li> Small operands of less than, say, 10 limbs.
+ <li> Medium size operands, that fit into the cache.
<li> Huge operands that does not fit into the cache.
</ol>
@@ 145,18 +119,6 @@
21bit pieces if one allows the split operands to be negative!)
<li> <strong>Math functions for the mpf layer</strong>

 <p> Implement the functions of math.h for the GMP mpf layer! Check the book
 "Pi and the AGM" by Borwein and Borwein for ideas how to do this. These
 functions are desirable: acos, acosh, asin, asinh, atan, atanh, atan2,
 cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.

 <p> Note that the <a href="http://mpfr.org">mpfr</a> functions already
 provide these functions, and that we usually recommend new programs to use
 mpfr instead of mpf.


<li> <strong>Faster sqrt</strong>
<p> The current code uses divisions, which are reasonably fast, but it'd be
@@ 180,9 +142,24 @@
<li> <strong>Nth root</strong>
 <p> Improve mpn_rootrem. The current code is not too bad, but its average
 time complexity is a function of the input, while it is possible to
 make it a function of the output.
+ <p> Improve mpn_rootrem. The current code is not too bad, but its time
+ complexity is a function of the input, while it is possible to make
+ the <i>average</i> complexity a function of the output.
+
+
+<li> <strong>Fat binaries</strong>
+
+ <p> Add more functions to the set of fat functions.
+
+ <p> The speed of multipliciaton is today highly dependent on combination
+ functions like <code>addlsh1_n</code>. A fat binary will never use any such
+ functions, since they are classified as optional. Ideally, we should use
+ them, but making the current compiletime selections of optional functions
+ become runtime selections for fat binaries.
+
+ <p> If we make fat binaries work really well, we should move away frm tehe
+ current configure scheme (at least by default) and instead include all code
+ always.
<li> <strong>Exceptions</strong>
@@ 343,131 +320,15 @@
<code>gmp_restrict</code>.
<li> <strong>Nx1 Division</strong>

 <p> The limbbylimb dependencies in the existing Nx1 division (and
 remainder) code means that chips with multiple execution units or
 pipelined multipliers are not fully utilized.

 <p> One possibility is to follow the current preinv method but taking two
 limbs at a time. That means a 2x2>4 and a 2x1>2 multiply for
 each two limbs processed, and because the 2x2 and 2x1 can each be done in
 parallel the latency will be not much more than 2 multiplies for two
 limbs, whereas the single limb method has a 2 multiply latency for just
 one limb. A version of <code>mpn_divrem_1</code> doing this has been
 written in C, but not yet tested on likely chips. Clearly this scheme
 would extend to 3x3>9 and 3x1>3 etc, though with diminishing
 returns.

 <p> For <code>mpn_mod_1</code>, Peter L. Montgomery proposes the following
 scheme. For a limb R=2^<code>bits_per_mp_limb</code>, precalculate
 values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend
 limbs and multiply them by those values, thereby reducing them (moving
 them down) by the corresponding factor. The products can be added to
 produce an intermediate remainder of 2 or 3 limbs to be similarly
 included in the next step. The point is that such multiplies can be done
 in parallel, meaning as little as 1 multiply worth of latency for 4
 limbs. If the modulus N is less than R/4 (or is it R/5?) the summed
 products will fit in 2 limbs, otherwise 3 will be required, but with the
 high only being small. Clearly this extends to as many factors of R as a
 chip can efficiently apply.

 <p> The logical conclusion for powers R^i is a whole array "p[i] = R^i mod N"
 for i up to k, the size of the dividend. This could then be applied at
 multiplier throughput speed like an inner product. If the powers took
 roughly k divide steps to calculate then there'd be an advantage any time
 the same N was used three or more times. Suggested by Victor Shoup in
 connection with chineseremainder style decompositions, but perhaps with
 other uses.

 <p> <code>mpn_modexact_1_odd</code> calculates an x in the range 0<=x<d
 satisfying a = q*d + x*b^n, where b=2^bits_per_limb. The factor b^n
 needed to get the true remainder r could be calculated by a powering
 algorithm, allowing <code>mpn_modexact_1_odd</code> to be pressed into
 service for an <code>mpn_mod_1</code>. <code>modexact_1</code> is
 simpler and on some chips can run noticeably faster than plain
 <code>mod_1</code>, on Athlon for instance 11 cycles/limb instead of 17.
 Such a difference could soon overcome the time to calculate b^n. The
 requirement for an odd divisor in <code>modexact</code> can be handled by
 some shifting onthefly, or perhaps by an extra partiallimb step at the
 end.

<li> <strong>Factorial</strong>
 <p> The removal of twos in the current code could be extended to factors of 3
 or 5. Taking this to its logical conclusion would be a complete
 decomposition into powers of primes. The power for a prime p is of
 course floor(n/p)+floor(n/p^2)+... Conrad Curry found this is quite fast
 (using simultaneous powering as per Handbook of Applied Cryptography
 algorithm 14.88).

 <p> A difficulty with using all primes is that quite large n can be
 calculated on a system with enough memory, larger than we'd probably want
 for a table of primes, so some sort of sieving would be wanted. Perhaps
 just taking out the factors of 3 and 5 would give most of the speedup
 that a prime decomposition can offer.
+ <p> Rewrite for simplicty and speed. Work is in progress.
<li> <strong>Binomial Coefficients</strong>
 <p> An obvious improvement to the current code would be to strip factors of 2
 from each multiplier and divisor and count them separately, to be applied
 with a bit shift at the end. Factors of 3 and perhaps 5 could even be
 handled similarly.

 <p> Conrad Curry reports a big speedup for binomial coefficients using a
More information about the gmpcommit
mailing list