GMP's x86-32 performance

Sat Jun 17 13:24:02 UTC 2017

nisse at lysator.liu.se (Niels Möller) writes:

  > Our latest batch of x86-32 code dates from 2011 (for the original Intel
  > atom) but we have not done anything for high-end AMD and Intel CPUs
  > (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
  > Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
  > skylake, kabylake) in a very long time.

  When are those later cpus run in 32-bit mode? M$ windows or mac
  applications? I would have expected 64_64 mode, possibly with some use
  of the x32 abi (small pointers), to be used almost exclusively by now.

The use is probably limited, but some applications still uses the 32-bit
ABI and there only a limb32 GMP is available.  And of course 32-bit
Unix, 32-bit M$DOS, or 32-bit MacOS will enforce limb32.

That's why I wrote "limited effort"; x86-64 code is more important.  But
here we have a 4x speedup waiting to be harvested!

  I think I looked at pmuludq in the past, the variant doing two 32x32->64
  multiplies, without having any success. IIRC, the throughput of that
  instruction on then current cpus was too poor to make it useful. Other
  possible reasons for failure: (i) I didn't try hard enough, (ii) too
  much shuffling around of the operands are needed.

That used to be my experience too.

Things have changed since then, New data movement instructions (e.g.,
sse4.1 pmovzxdq, avx2 vperm2i128, as well as many other suffle insns)
and wider executions paths seem to allow for a great speedup.

  BTW, speaking of addmul_2. Where current addmul_2 wins over addmul_1,
  that's because we get more independent mul instructions and can more
  easily saturate multiplication units. At least, that's my understanding.

Also addmul_2 used fewer load/store per multiply than addmul_1.  But for
SIMD, we might find it hard to sum things together for the carry chain
unless we go to addmul_2 or addmul_4.

  We've considered using karatsuba aka toom2 for addmul_2, but it has
  always turned out that saving 1/4 of the multiply instructions is very
  easily eaten up by the additional operations needed. But the other day,
  it striked me that we might also try doing addmul_2 using toom32, which
  would save 1/3 of the mul instructions. Toom32 is nice because we can
  use the four easiest evaluation points: 0, infinity, and +/-1.

Give it a shot!  A concern with any toom stuff is side-channel leakage.
We currently assume _basecase is non-leaky.  So any low-level trickery
needs to be applied above the _basecase cutoff points (unless we split
out _basecase_sec functions).

  Or addmul_3 using toom32, which has the additional advantage that more
  of the evaluation work is loop-invariant, and we could also jump to
  separate innerloops depending on the carry bits from evaluation.

  Perhaps this is still crazy, and useful only for machines with very slow
  multiplication.

Slow hardware multiplication helps.  :-)

But we should never be content with past conclusions in the GMP world.
We might have been mistaken in the past, and the world has probably
changed since we made our conclusions.

-- 
Torbjörn
Please encrypt, key id 0xC8601622