GMP's x86-32 performance

Torbjörn Granlund tg at gmplib.org
Sat Jun 17 12:27:24 UTC 2017


The new measurement reporting pages have highlighted many improvement
opportunities, and as you might have seen I've lately fixed a handful of
the _basecase functions for x86-64.

An aspect not directly covered by the new measurement reporting is that
the 32-bit and 64-bit performance-per-limb is comparable for each CPU,
so I took a look at the timing tables manually.

With current GMP code, it turns out that 32-bit performance-per-limb is
typically worse or much worse for any semi modern CPU.  For non-multiply
functions, the performance disadvantage is not too bad, but >= 50%
speedup is possible in most cases.

For multiply, the speedup headroom is much greater, and since these
already suffer quadratically in limb size compared to the 64-bit code,
performance improvements here will be quite significant.

Our latest batch of x86-32 code dates from 2011 (for the original Intel
atom) but we have not done anything for high-end AMD and Intel CPUs
(e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
skylake, kabylake) in a very long time.

I took a quick look into what could be done with a limited effort.  We
currently use 64-bit sse2 code for some CPUs since that was optimal at
the time.  However, all above listed CPUs support full execution rate
for 128-bit sse2 code, but we don't make use of that anyplace.
Furthermore, Intel haswell, broadwell, skylake, kabylake have 256-bit
execution of avx2.

What do I have in mind?  I believe pmovzxdq, pmuludq, psrlq (or some
shuffle insn), and paddq could be used to build an addmul_2 which runs
at at close to 1 cycle/limb using sse2, and with avx2 the corresponding
instructions could build an addmul_4 running at not far from 0.5
cycle/limb.

Any volunteer willing to give it a try?

One might wonder whether sse2/avx2 couldn't also improve the x86-64
code.  Unfortunately, they cannot.  The absence of 64-bit SIMD multiply
makes plain scalar ops far better.

-- 
Torbjörn
Please encrypt, key id 0xC8601622


More information about the gmp-devel mailing list