[Gmp-commit] /var/hg/gmp: Update some READMEs.

Sat Aug 19 19:39:00 CEST 2023

details:   /var/hg/gmp/rev/028e572a8fa2
changeset: 18433:028e572a8fa2
user:      Torbjorn Granlund <tg at gmplib.org>
date:      Sat Aug 19 19:38:55 2023 +0200
description:
Update some READMEs.

diffstat:

 mpn/s390_64/README     |  88 ++++++++++++++++++++++++++++++++++++++-----------
 mpn/s390_64/z13/README |  27 ---------------
 2 files changed, 68 insertions(+), 47 deletions(-)

diffs (147 lines):

diff -r d7265d0824d3 -r 028e572a8fa2 mpn/s390_64/README

--- a/mpn/s390_64/README	Sat Aug 19 19:26:25 2023 +0200
+++ b/mpn/s390_64/README	Sat Aug 19 19:38:55 2023 +0200
@@ -1,4 +1,4 @@
-Copyright 2011 Free Software Foundation, Inc.
+Copyright 2011, 2023 Free Software Foundation, Inc.
 
 This file is part of the GNU MP Library.
 
@@ -30,11 +30,12 @@
 
 There are many generations of 64-bit s390 processors, z900, z990, z9, z10,
 z196, z12, z13, z14, z15, and z16.  The current GMP code was originally
-optimised for theb two oldest, z900 and z990.  Better code for z13 and later
-can be found in aptly named subdirectories.  The status comments below are for
-the original code.
+optimised for the two oldest, z900 and z990.  Better code for z13 and later can
+be found in aptly named subdirectories.
 
 
+STATUS
+
 mpn_copyi
 
 This code makes use of a loop around MVC.  It almost surely runs very
@@ -67,24 +68,71 @@
 
 mpn_mul_1, mpn_addmul_1, mpn_submul_1
 
-The current code is very naive, but due to the non-pipelined nature of
-MLGR on z900 and z990, more sophisticated code would not gain much.
-
-On z10 one would need to cluster at least 4 MLGR together, in order to
-reduce stalling.
-
-On z196, one surely want to use unrolling and pipelining, to perhaps
-reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
-stalling.
+The current code is 4-way unrolled, but only timed on z15, where we anyway have
+better code.  It is believed to run well on older CPUs.
 
 
 mpn_mul_2, mpn_addmul_2
 
-At least for older machines (z900, z990) with very slow MLGR, we
-should use Karatsuba's algorithm on 2-limb units, making mul_2 and
-addmul_2 the main multiplication primitives.  The newer machines might
-benefit less from this approach, perhaps in particular z10, where MLGR
-clustering is more important.
+These are really, really hard to write as the needed MLG clustering and the
+register shortage leave very little wiggle room.  Our attempts run slower than
+combinations of mul_1 and addmul_1.
+
+mpn_mul_basecase, mpn_sqr_basecase
+
+These use old un-unrolled mul_1 and addmul_1.
+
+
+TODO
+
+1. Somehow clean up instruction set differences.  The ad hoc test of
+   e.g. addmul_1 for playing with vl, vster, vpdi etc, is not great.
+   Then, we have code in the z14 and z15 subdirs which almost runs on
+   older CPUs, which is also not great.
+
+2. Support fat binaries.  Perhaps that's where to start, solving 1 at
+   the same time.
+
+3. Fix sqr_basecase's abysmal performance, both the top-level code and the z13
+   code.  Performance suffers due to branch prediction problems.  The top-level
+   code also needs the unrolled mul_1 and addmul_1.  (Branch prediction is hard
+   for sqr_basecase-type code as the inner loop trip count decreases with every
+   turn in the outer loop.  That plays very poorly with z15's apparent counting
+   branch predictors.  Induction variable detection branch predictors are
+   needed.)
+
+   So how to fix sqr_basecase's performance?  Supposedly by doing more work for
+   each inner loops invocation, thus diluting the branch prediction miss cost.
+   What?  For z13, probably just base it on addmul_2.
 
-With Karatsuba, one could hope for around 16 cycles per accumulated
-128 cross product, on z990.
+   But this, writing sqr_basecase in general and addmul_2 based sqr_basecase in
+   particular, is really tricky asm programming.  The doubling-on-the fly
+   needed for sqr_basecase can be confusing with addmul_1 based code, and
+   becomes quite tricky when using addmul_2.
+
+Torbjörn has faster mul_2 and addmul_2, running at close to 2 cycles/limb on
+z15.  That's not a whole lot faster than mul_1 and addmul_1, and as
+sqr_basecase and mul_basecase are based on the _1 variants, we have not
+committed this faster _2 code.
+
+Here is how a new mul_basecase should be organised:
+
+  1. If the rp pointer is 128-bit aligned, start with mul_2 to keep alignment.
+     Else start with mul_1.  Now rp will be 128-bit aligned.
+
+  2. Loop over addmul_2.  Probably don't expand it into 4 variants (addmul_2 is
+     4-way unrolled) as that practice pays off less with the fewer outer loop
+     iterations which are the result of using addmul_2.  Instead, do the n mod
+     4 handling before each run.
+
+  3. If there is now anything to do, finish off with an addmul_1.
+
+This means that we will sometimes both do a mul_1 first and an addmul_1 last,
+even if bn mod 2 == 0.  It is expected that that will be beneficial,
+considering the alignment penalty for the 128-operations and the fact that the
+_2 functions are not dramatically faster than the _1 functions.
+
+A new sqr_basecase should use addmul_2 too, as mentioned above.  Here, we might
+get significant improvements as the branch predictor performs abysmally given
+the structure of sqr_basecase; an addmul_2 based variant should cut the number
+of mispredicted branches in half.
diff -r d7265d0824d3 -r 028e572a8fa2 mpn/s390_64/z13/README
--- a/mpn/s390_64/z13/README	Sat Aug 19 19:26:25 2023 +0200
+++ b/mpn/s390_64/z13/README	Sat Aug 19 19:38:55 2023 +0200
@@ -35,30 +35,3 @@
 very useful.  Unfortunately, the multiply support is unexpectedly limited,
 forcing GMP to use the the legacy mlg/mlgr instructions, juggling results
 between plain and vector registers.
-
-
-Torbjörn has faster mul_2 and addmul_2, running at close to 2 cycles/limb on
-z15.  That's not a whole lot faster than mul_1 and addmul_1, and as
-sqr_basecase and mul_basecase are based on the _1 variants, we have not
-committed the faster _2 code.
-
-Here is how a new mul_basecase should be organised:
-
-  1. If the rp pointer is 128-bit aligned, start with mul_2 to keep alignment.
-     Else start with mul_1.  Now rp will be 128-bit aligned.
-
-  2. Loop over addmul_2.  Probably don't expand it into 4 variants (addmul_2 is
-     4-way unrolled) as that practice pays off less with the fewer outer loop
-     iterations which are the result of using addmul_2.  Instead, do the n mod
-     4 handling before each run.
-
-  3. If there is now anything to do, finish off with an addmul_1.
-
-This means that we will sometimes both do a mul_1 first and an addmul_1 last,
-even if bn mod 2 == 0.  It is expected that that will be beneficial,
-considering the alignment penalty for the 128-operations and the fact that the
-_2 functions are not dramatically faster then the _1 functions.
-
-A new sqr_basecase should use addmul_2 too.  Here, we might get significant
-improvements as the branch predictor performs abysmally given the structure of
-sqr_basecase; an addmul_2 based variant cuts the number of branches in half.