[Gmp-commit] /var/hg/www: Add arm and sparc pages, link from devel index.

mercurial at gmplib.org mercurial at gmplib.org
Mon Apr 29 17:16:19 CEST 2013


details:   /var/hg/www/rev/90404982e563
changeset: 55:90404982e563
user:      Torbjorn Granlund <tege at gmplib.org>
date:      Mon Apr 29 17:16:13 2013 +0200
description:
Add arm and sparc pages, link from devel index.

diffstat:

 devel/arm.html   |  195 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 devel/index.html |    8 +-
 devel/sparc.html |  170 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 372 insertions(+), 1 deletions(-)

diffs (truncated from 398 to 300 lines):

diff -r 92940f1f027b -r 90404982e563 devel/arm.html
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/devel/arm.html	Mon Apr 29 17:16:13 2013 +0200
@@ -0,0 +1,195 @@
+<!DOCTYPE HTML>
+<html>
+<head>
+  <title>GMP developers' ARM corner</title>
+  <link rel="shortcut icon" href="favicon.ico">
+  <link rel="stylesheet" href="new.css">
+  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+  <style type="text/css"> td {padding-left:4pt; padding-right:4pt;}</style>
+  <style type="text/css"> th {padding-left:4pt; padding-right:2pt;}</style>
+</head>
+
+<body>
+
+<div id="top">
+<table width="100%" bgcolor="#e8e8e8">
+  <tr>
+    <td align="left">
+      <svg width="180px" height="60px" version="1.1"
+	   viewBox="0 0 1500 500"
+	   xmlns="http://www.w3.org/2000/svg">
+	<rect x="0" y="0" width="1500" height="540" fill="#e8e8e8" />
+	<text x="0" y="440" fill="#e00000"  font-size="540" font-family="arial" font-weight="bold">
+	  GMP
+	</text>
+	<text x="50" y="500" font-size="70" font-family="Verdana">
+	  «Arithmetic without limitations»
+	</text>
+      </svg>
+    </td>
+    <td align="center">
+      <font size="+2">GMP developers' ARM corner</font>
+    </td>
+  </tr>
+</table>
+</div>
+
+<div id="container">
+  <div id="top-spacer"></div>
+
+<br><br>
+
+
+<hr>
+
+<h3> ARM core pipeline overview </h3>
+
+<blockquote>
+<table rules="groups" frame="void" cellpadding=4px>
+  <colgroup><col>
+  <thead>
+    <tr> <td>                             <th>    A7    <th>    A8    <th>    A9    <th>    A15
+  <tbody>
+    <tr> <td> issue width                 <td>    1-2   <td>    2     <td>    2     <td>     3   </tr>
+    <tr> <td> issue order                 <td> in order <td> in order <td> limited  <br> out-of-order <td> out-of-order </tr>
+    <tr> <td> Neon bits (most insn)       <td>   64     <td>   64     <td>   64     <td>  128    </tr>
+    <tr> <td> Neon bits/cycle (shifts imm count)<td>   64  <td>   64  <td>   64  <td>  128 </tr>
+    <tr> <td> Neon bits/cycle (shifts reg count)<td>   64  <td>   64  <td>   64  <td>   64  </tr>
+</table>
+</blockquote>
+
+<h3> ARM optimisation motivation </h3>
+
+<p>
+Recent ARM 32-bit CPUs have great GMP performance potential, far better than
+any other 32-bit processors.  Both the A9 and A15 can sustain 1/2 32 × 32
+→ 64 multiply per cycle using the core instruction set.
+Furthermore, A9 and A15 can sustain one respectively two such multiply
+operations per cycle using the Neon extensions.  The core and Neon multiply
+units are independent, meaning that A9 can sustain 1.5 mulops/cycle and A15 can
+sustain 2.5 mulops/cycle.
+</p>
+<p>
+The current GMP code utilises the mulop bandwidth very poorly.  The goal of
+this project is to utilise the hardware better, both the multiply hardware and
+the hardware for other critical operations.
+</p>
+
+<h3> ARM Cortex-A15 projects </h3>
+<p>
+The recent A15 progress wrt mpn_mul_1 and mpn_addmul_1 (see mailing list) has
+obsoleted many asm functions: mpn_rshift, mpn_addlsh1, mpn_addlsh2,
+mpn_cnd_add_n, and could obsolete also mpn_lshift, and perhaps also various
+sub/rsb functions.
+</p>
+
+<p>
+Somewhat surprisingly, the Neon unit has better multiply throughput than shift
+throughput, perhaps making multiply-based mpn_lshift and mpn_rshift the optimal
+approach.  An alternative is to use 64-bit shifting insns (allowing accurate
+destination sub-register) and 128-bit everything else.
+</p>
+
+<p>
+Even with a properly designed architecture like ARM/Neon, high-performance GMP
+code using Neon tend to be complicated, requiring very deep software
+pipelining.  To avoid poor small operand performance, we need to use as shallow
+software pipelining as possible, and carefully design feed-in and wind-down
+code.  If small operand performance is nevertheless worse than plain code, we
+need to provide special, well-optimised basecase code.  Such basecase code is
+not an alternative to low-overhead Neon code, but might simplify the Neon code
+since it will not need to handle tiny operands.
+</p>
+
+<p>
+TODO:
+<ul>
+  <li> Finish Neon 1.35 c/l mpn_mul_1.
+    See <a href="http://gmplib.org/list-archives/gmp-devel/2013-April/003299.html">the
+      gmp-devel archives</a>.  The sw pipeline first needs to be made shallower.
+  <li> Finish Neon 1.65 c/l mpn_addmul_1.
+    See <a href="http://gmplib.org/list-archives/gmp-devel/2013-April/003299.html">the
+      gmp-devel archives</a>.  The sw pipeline first needs to be made shallower.
+  <li> Write a Neon mpn_submul_1, starting with the 1.65 c/l addmul_1,
+    complementing U on-the-fly.  Goal performance ≤ 1.82 c/l.
+  </li>
+  <li> Rewrite mpn_lshift, mpn_rshift, mpn_lshiftc (currently 1.5 c/l).  Using
+    Neon shift (128-bit insns split by the hardware, or 64-bit insns) sets an
+    lower bound of 1 c/l.  Using vmull.32 or vmlal.32 sets a lower bound of 0.5
+    c/l.
+  </li>
+  <li> Rewrite arm/v7a/cora15/neon/aorsorrlshC_n.asm (currently 2.5 c/l).  For
+    mpn_addlshC_n we may perhaps just fall back to mpn_addmul_1 (splitting the
+    rp operand), but the subtracting variants are a lot trickier and will need
+    a different scheme.  (Perhaps fall back to a future fast mpn_submul_1).
+  </li>
+  <li> Write a Neon mpn_addmul_2, similar to the new mpn_addmul_1.  Performance goal: 1.3 c/l.
+  </li>
+  <li> Write mpn_addmul_[k] for k ≥ 3 running at ≤ 1 c/l.  Note that the
+    vaddw.u32 scheme of mpn_addmul_1 and mpn_addmul_2 will not work, as the
+    64-bit accumulator would overflow.
+  </li>
+  <li> Write mpn_mul_basecase using the fastest mpn_addmul_[k] using overlapped
+    software pipelining.
+  <li> Write mpn_sqr_basecase using mpn_addmul_2 using overlapped software
+    pipelining.
+  <li> Write mpn_redc_1 perhaps for just a few sizes, handling just n ≤
+    REDC_1_TO_REDC_2_THRESHOLD.
+  <li> Write mpn_redc_2.
+  </li>
+  <li> Write mpn_mod_1s_2p, mpn_mod_1s_3p, mpn_mod_1s_4p using Neon at least
+    for the multiplies not on the critical path.  This should get us to around
+    1 c/l.
+  </li>
+  <li> Write mpn_mod_34lsub1 using Neon, for 0.6 c/l on A15 (and 1.0 c/l on
+    A9).  Perhaps also write a core mpn_mod_34lsub1, using ldrd (reaching 0.9
+    c/l on A15), or a hybrid code/Neon variant...
+  </li>
+</ul>
+</p>
+
+
+<p>
+DONE:
+<ul>
+  <li> Write a core insn mpn_submul_1 based on the 2.0 c/l mpn_addmul_1.
+  </li>
+</ul>
+</p>
+
+<br><br>
+
+<div id="footer-spacer"></div>
+
+</div>
+
+<div id="footer">
+<font size="-4">Last modified: 2013-04-29 </font>
+<table cellpadding=0 width="100%" bgcolor="#e8e8e8">
+  <tr>
+    <td align="center">
+      <font size="-3">
+	Please send comments about this page to gmp-discuss<font> at </font>gmplib.org
+      </font>
+    </td>
+  </tr>
+  <tr>
+    <td align="center">
+      <font size="-3">
+	Copyright 2013 Free Software Foundation
+      </font>
+    </td>
+  </tr>
+  <tr>
+    <td align="center">
+      <font size="-3">
+	Verbatim copying and distribution of this entire article is permitted
+	in any medium, provided this notice is preserved.
+      </font>
+    </td>
+  </tr>
+</table>
+</div>
+
+</body>
+</html>
diff -r 92940f1f027b -r 90404982e563 devel/index.html
--- a/devel/index.html	Tue Apr 09 22:58:01 2013 +0200
+++ b/devel/index.html	Mon Apr 29 17:16:13 2013 +0200
@@ -74,6 +74,12 @@
 <p> <a href="GMPng.html">List of planned GMP improvements</a>
 </p>
 
+<p> <a href="arm.html">List of desirable ARM improvements</a>
+</p>
+
+<p> <a href="sparc.html">List of desirable SPARC (T4-T5) improvements</a>
+</p>
+
 <hr>
 
 
@@ -390,7 +396,7 @@
 </div>
 
 <div id="footer">
-<font size="-4">Last modified: 2013-02-12 </font>
+<font size="-4">Last modified: 2013-04-28 </font>
 <table cellpadding=0 width="100%" bgcolor="#e8e8e8">
   <tr>
     <td align="center">
diff -r 92940f1f027b -r 90404982e563 devel/sparc.html
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/devel/sparc.html	Mon Apr 29 17:16:13 2013 +0200
@@ -0,0 +1,170 @@
+<!DOCTYPE HTML>
+<html>
+<head>
+  <title>GMP developers' SPARC corner</title>
+  <link rel="shortcut icon" href="favicon.ico">
+  <link rel="stylesheet" href="new.css">
+  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
+  <style type="text/css"> td {padding-left:4pt; padding-right:4pt;}</style>
+  <style type="text/css"> th {padding-left:4pt; padding-right:2pt;}</style>
+</head>
+
+<body>
+
+<div id="top">
+<table width="100%" bgcolor="#e8e8e8">
+  <tr>
+    <td align="left">
+      <svg width="180px" height="60px" version="1.1"
+	   viewBox="0 0 1500 500"
+	   xmlns="http://www.w3.org/2000/svg">
+	<rect x="0" y="0" width="1500" height="540" fill="#e8e8e8" />
+	<text x="0" y="440" fill="#e00000"  font-size="540" font-family="arial" font-weight="bold">
+	  GMP
+	</text>
+	<text x="50" y="500" font-size="70" font-family="Verdana">
+	  «Arithmetic without limitations»
+	</text>
+      </svg>
+    </td>
+    <td align="center">
+      <font size="+2">GMP developers' SPARC corner</font>
+    </td>
+  </tr>
+</table>
+</div>
+
+<div id="container">
+  <div id="top-spacer"></div>
+
+<br><br>
+
+
+<hr>
+
+<h3> SPARC core pipeline overview </h3>
+
+<blockquote>
+<table rules="groups" frame="void" cellpadding=4px>
+  <colgroup><col>
+  <thead>
+    <tr> <td>             <th>    US1 US2    <th>    US3 US4     <th>   T1-T2  <th>   T3     <th>    T4-T5
+  <tbody>
+    <tr> <td> issue width <td> 4 (2I,1LS,2FP)<td> 4 (2I,1LS,2FP) <td>     2    <td>    2     <td>   2   </tr>
+    <tr> <td> issue order <td> in order      <td> in order       <td> in order <td> in order <td> out-of-order </tr>
+</table>
+<p>FP=floating point, LS=load/stor, I=intop</p>
+</blockquote>
+
+<h3> SPARC optimisation motivation </h3>
+
+<p>
+SPARC chips before T4 under-perform on GMP.  This is not because the GMP code is
+inadequately optimised for SPARC, but due to the basic v9 ISA as well as the
+micro-architecture of these chips.  The T1 and T2 chips perform worse than any
+other SPARC chips; they compare to a 15 year older 486 chip.
+</p>
+<p>
+The T4/T5 are completely different, and are not at all bad GMP performers; they
+are merely 2-3 times slower than a concurrent PC (using GMP repo code for
+SPARC).  They are just 2-issue and can perform just one 64-bit ld/st per cycle,
+but they are out-of-order and have a fully pipelined integer multiply unit,
+albeit with an extreme latency of 12 cycles.  Unlike older SPARCs, they (and


More information about the gmp-commit mailing list