[Gmp-commit] /var/hg/gmp: Update some READMEs.
mercurial at gmplib.org
mercurial at gmplib.org
Sat Aug 19 19:39:00 CEST 2023
details: /var/hg/gmp/rev/028e572a8fa2
changeset: 18433:028e572a8fa2
user: Torbjorn Granlund <tg at gmplib.org>
date: Sat Aug 19 19:38:55 2023 +0200
description:
Update some READMEs.
diffstat:
mpn/s390_64/README | 88 ++++++++++++++++++++++++++++++++++++++-----------
mpn/s390_64/z13/README | 27 ---------------
2 files changed, 68 insertions(+), 47 deletions(-)
diffs (147 lines):
diff -r d7265d0824d3 -r 028e572a8fa2 mpn/s390_64/README
--- a/mpn/s390_64/README Sat Aug 19 19:26:25 2023 +0200
+++ b/mpn/s390_64/README Sat Aug 19 19:38:55 2023 +0200
@@ -1,4 +1,4 @@
-Copyright 2011 Free Software Foundation, Inc.
+Copyright 2011, 2023 Free Software Foundation, Inc.
This file is part of the GNU MP Library.
@@ -30,11 +30,12 @@
There are many generations of 64-bit s390 processors, z900, z990, z9, z10,
z196, z12, z13, z14, z15, and z16. The current GMP code was originally
-optimised for theb two oldest, z900 and z990. Better code for z13 and later
-can be found in aptly named subdirectories. The status comments below are for
-the original code.
+optimised for the two oldest, z900 and z990. Better code for z13 and later can
+be found in aptly named subdirectories.
+STATUS
+
mpn_copyi
This code makes use of a loop around MVC. It almost surely runs very
@@ -67,24 +68,71 @@
mpn_mul_1, mpn_addmul_1, mpn_submul_1
-The current code is very naive, but due to the non-pipelined nature of
-MLGR on z900 and z990, more sophisticated code would not gain much.
-
-On z10 one would need to cluster at least 4 MLGR together, in order to
-reduce stalling.
-
-On z196, one surely want to use unrolling and pipelining, to perhaps
-reach around 12 c/l. A major issue here and on z10 is ALCGR's 3 cycle
-stalling.
+The current code is 4-way unrolled, but only timed on z15, where we anyway have
+better code. It is believed to run well on older CPUs.
mpn_mul_2, mpn_addmul_2
-At least for older machines (z900, z990) with very slow MLGR, we
-should use Karatsuba's algorithm on 2-limb units, making mul_2 and
-addmul_2 the main multiplication primitives. The newer machines might
-benefit less from this approach, perhaps in particular z10, where MLGR
-clustering is more important.
+These are really, really hard to write as the needed MLG clustering and the
+register shortage leave very little wiggle room. Our attempts run slower than
+combinations of mul_1 and addmul_1.
+
+mpn_mul_basecase, mpn_sqr_basecase
+
+These use old un-unrolled mul_1 and addmul_1.
+
+
+TODO
+
+1. Somehow clean up instruction set differences. The ad hoc test of
+ e.g. addmul_1 for playing with vl, vster, vpdi etc, is not great.
+ Then, we have code in the z14 and z15 subdirs which almost runs on
+ older CPUs, which is also not great.
+
+2. Support fat binaries. Perhaps that's where to start, solving 1 at
+ the same time.
+
+3. Fix sqr_basecase's abysmal performance, both the top-level code and the z13
+ code. Performance suffers due to branch prediction problems. The top-level
+ code also needs the unrolled mul_1 and addmul_1. (Branch prediction is hard
+ for sqr_basecase-type code as the inner loop trip count decreases with every
+ turn in the outer loop. That plays very poorly with z15's apparent counting
+ branch predictors. Induction variable detection branch predictors are
+ needed.)
+
+ So how to fix sqr_basecase's performance? Supposedly by doing more work for
+ each inner loops invocation, thus diluting the branch prediction miss cost.
+ What? For z13, probably just base it on addmul_2.
-With Karatsuba, one could hope for around 16 cycles per accumulated
-128 cross product, on z990.
+ But this, writing sqr_basecase in general and addmul_2 based sqr_basecase in
+ particular, is really tricky asm programming. The doubling-on-the fly
+ needed for sqr_basecase can be confusing with addmul_1 based code, and
+ becomes quite tricky when using addmul_2.
+
+Torbjörn has faster mul_2 and addmul_2, running at close to 2 cycles/limb on
+z15. That's not a whole lot faster than mul_1 and addmul_1, and as
+sqr_basecase and mul_basecase are based on the _1 variants, we have not
+committed this faster _2 code.
+
+Here is how a new mul_basecase should be organised:
+
+ 1. If the rp pointer is 128-bit aligned, start with mul_2 to keep alignment.
+ Else start with mul_1. Now rp will be 128-bit aligned.
+
+ 2. Loop over addmul_2. Probably don't expand it into 4 variants (addmul_2 is
+ 4-way unrolled) as that practice pays off less with the fewer outer loop
+ iterations which are the result of using addmul_2. Instead, do the n mod
+ 4 handling before each run.
+
+ 3. If there is now anything to do, finish off with an addmul_1.
+
+This means that we will sometimes both do a mul_1 first and an addmul_1 last,
+even if bn mod 2 == 0. It is expected that that will be beneficial,
+considering the alignment penalty for the 128-operations and the fact that the
+_2 functions are not dramatically faster than the _1 functions.
+
+A new sqr_basecase should use addmul_2 too, as mentioned above. Here, we might
+get significant improvements as the branch predictor performs abysmally given
+the structure of sqr_basecase; an addmul_2 based variant should cut the number
+of mispredicted branches in half.
diff -r d7265d0824d3 -r 028e572a8fa2 mpn/s390_64/z13/README
--- a/mpn/s390_64/z13/README Sat Aug 19 19:26:25 2023 +0200
+++ b/mpn/s390_64/z13/README Sat Aug 19 19:38:55 2023 +0200
@@ -35,30 +35,3 @@
very useful. Unfortunately, the multiply support is unexpectedly limited,
forcing GMP to use the the legacy mlg/mlgr instructions, juggling results
between plain and vector registers.
-
-
-Torbjörn has faster mul_2 and addmul_2, running at close to 2 cycles/limb on
-z15. That's not a whole lot faster than mul_1 and addmul_1, and as
-sqr_basecase and mul_basecase are based on the _1 variants, we have not
-committed the faster _2 code.
-
-Here is how a new mul_basecase should be organised:
-
- 1. If the rp pointer is 128-bit aligned, start with mul_2 to keep alignment.
- Else start with mul_1. Now rp will be 128-bit aligned.
-
- 2. Loop over addmul_2. Probably don't expand it into 4 variants (addmul_2 is
- 4-way unrolled) as that practice pays off less with the fewer outer loop
- iterations which are the result of using addmul_2. Instead, do the n mod
- 4 handling before each run.
-
- 3. If there is now anything to do, finish off with an addmul_1.
-
-This means that we will sometimes both do a mul_1 first and an addmul_1 last,
-even if bn mod 2 == 0. It is expected that that will be beneficial,
-considering the alignment penalty for the 128-operations and the fact that the
-_2 functions are not dramatically faster then the _1 functions.
-
-A new sqr_basecase should use addmul_2 too. Here, we might get significant
-improvements as the branch predictor performs abysmally given the structure of
-sqr_basecase; an addmul_2 based variant cuts the number of branches in half.
More information about the gmp-commit
mailing list