GMP's x86-32 performance

Niels Möller nisse at
Sat Jun 17 13:06:27 UTC 2017

tg at (Torbjörn Granlund) writes:

> Our latest batch of x86-32 code dates from 2011 (for the original Intel
> atom) but we have not done anything for high-end AMD and Intel CPUs
> (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
> Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
> skylake, kabylake) in a very long time.

When are those later cpus run in 32-bit mode? M$ windows or mac
applications? I would have expected 64_64 mode, possibly with some use
of the x32 abi (small pointers), to be used almost exclusively by now.

> What do I have in mind?  I believe pmovzxdq, pmuludq, psrlq (or some
> shuffle insn), and paddq could be used to build an addmul_2 which runs
> at at close to 1 cycle/limb using sse2,

I think I looked at pmuludq in the past, the variant doing two 32x32->64
multiplies, without having any success. IIRC, the throughput of that
instruction on then current cpus was too poor to make it useful. Other
possible reasons for failure: (i) I didn't try hard enough, (ii) too
much shuffling around of the operands are needed.

BTW, speaking of addmul_2. Where current addmul_2 wins over addmul_1,
that's because we get more independent mul instructions and can more
easily saturate multiplication units. At least, that's my understanding.

We've considered using karatsuba aka toom2 for addmul_2, but it has
always turned out that saving 1/4 of the multiply instructions is very
easily eaten up by the additional operations needed. But the other day,
it striked me that we might also try doing addmul_2 using toom32, which
would save 1/3 of the mul instructions. Toom32 is nice because we can
use the four easiest evaluation points: 0, infinity, and +/-1.

Or addmul_3 using toom32, which has the additional advantage that more
of the evaluation work is loop-invariant, and we could also jump to
separate innerloops depending on the carry bits from evaluation.

Perhaps this is still crazy, and useful only for machines with very slow


Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list