GMP's x86-32 performance
Torbjörn Granlund
tg at gmplib.org
Sat Jun 17 12:27:24 UTC 2017
The new measurement reporting pages have highlighted many improvement
opportunities, and as you might have seen I've lately fixed a handful of
the _basecase functions for x86-64.
An aspect not directly covered by the new measurement reporting is that
the 32-bit and 64-bit performance-per-limb is comparable for each CPU,
so I took a look at the timing tables manually.
With current GMP code, it turns out that 32-bit performance-per-limb is
typically worse or much worse for any semi modern CPU. For non-multiply
functions, the performance disadvantage is not too bad, but >= 50%
speedup is possible in most cases.
For multiply, the speedup headroom is much greater, and since these
already suffer quadratically in limb size compared to the 64-bit code,
performance improvements here will be quite significant.
Our latest batch of x86-32 code dates from 2011 (for the original Intel
atom) but we have not done anything for high-end AMD and Intel CPUs
(e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
skylake, kabylake) in a very long time.
I took a quick look into what could be done with a limited effort. We
currently use 64-bit sse2 code for some CPUs since that was optimal at
the time. However, all above listed CPUs support full execution rate
for 128-bit sse2 code, but we don't make use of that anyplace.
Furthermore, Intel haswell, broadwell, skylake, kabylake have 256-bit
execution of avx2.
What do I have in mind? I believe pmovzxdq, pmuludq, psrlq (or some
shuffle insn), and paddq could be used to build an addmul_2 which runs
at at close to 1 cycle/limb using sse2, and with avx2 the corresponding
instructions could build an addmul_4 running at not far from 0.5
cycle/limb.
Any volunteer willing to give it a try?
One might wonder whether sse2/avx2 couldn't also improve the x86-64
code. Unfortunately, they cannot. The absence of 64-bit SIMD multiply
makes plain scalar ops far better.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list