Basecase assembly optimisation project

Thu Sep 26 21:20:29 CEST 2013

For the last few months, I have been working on writing and rewriting
basecase code for X64-64 processors.  The result is now in the
mainline GMP repo.

The basecase code I have focused on is: mul_basecase, sqr_basecase,
mullo_basecase, and Hensel remainder via redc_1.

At the start of this project, we had good code only for AMD K8-K10 and
only for mul_basecase, sqr_basecase and mullo_basecase.

Now there is optimised code for many CPUs for these functions, see the
below diagram.  (X means new code, o means old code.)

                                mul    sqr    mullo   redc_1
AMD K8                           o      o       o       X
AMD K10                          o      o       o       X
AMD Bobcat                       X      X               X
AMD Bulldozer                    X                      X
Intel Conroe/Wolfdale            X      X       X       X
Intel Nehalem/Westmere           X      X       X       X
Intel Sandybridge/Ivybridge      X      X       X       X
Intel Haswell                    X      X       X       X
Intel Atom                       -      -       -       X

There are diagrams for the performance of the new code at
<http://gmplib.org/devel/>.  These diagrams don't include old
performance, but instead should be used to look for further
optimisation opportunities by musing over non-smoothness or unexpected
relative performance.

Only Nehalem (first generation Core i7) performs precisely like I
want, since only Nehalem and Conroe (Core2) have the expanded
mul_basecase and sqr_basecase which will be the new standard.  The
code is "expanded" in the sense that it at entry chooses one of
typically 4 outer loops, where each outer loop contains a tailored
inner loop.  This structure implies an ugly code bloat, but it saves
enough execution time in these critical functions to be worth it.

If you want to give the new code a spin, you may want to grab
e.g. <ftp://ftp.gmplib.org/pub/snapshot/gmp-5.1.90-20130926.tar.lz>.

What speedup should one expect?  It varies with CPU, operand size, and
operation.  Haswell (latest Intel CPU) and Ivy bridge (penultimate
Intel CPU) have gotten greatest improvement.  Of the operations,
redc_1 has seen the greatest improvement thanks to pipelined quotient
computation.  The speedup will vary from 50% for modexp on Haswell, to
only 6% for multiplication on Conroe. I think typical speedup will be
around 10%.

For the time being, I have eliminated special code for tiny limb
counts like 1, 2, 3, except when the mainline code couldn't handle
those sizes naturally.  Even then, the code is pretty naive.  This
will surely imply significant slowdown in some cases; I will insert
such tiny-operands code on a case-by-case basis.  Tuning the generic
code has higher priority.

-- 
Torbjörn