GMP's x86-32 performance

Niels Möller nisse at lysator.liu.se
Sat Jun 17 13:06:27 UTC 2017


tg at gmplib.org (Torbjörn Granlund) writes:

> Our latest batch of x86-32 code dates from 2011 (for the original Intel
> atom) but we have not done anything for high-end AMD and Intel CPUs
> (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
> Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
> skylake, kabylake) in a very long time.

When are those later cpus run in 32-bit mode? M$ windows or mac
applications? I would have expected 64_64 mode, possibly with some use
of the x32 abi (small pointers), to be used almost exclusively by now.

> What do I have in mind?  I believe pmovzxdq, pmuludq, psrlq (or some
> shuffle insn), and paddq could be used to build an addmul_2 which runs
> at at close to 1 cycle/limb using sse2,

I think I looked at pmuludq in the past, the variant doing two 32x32->64
multiplies, without having any success. IIRC, the throughput of that
instruction on then current cpus was too poor to make it useful. Other
possible reasons for failure: (i) I didn't try hard enough, (ii) too
much shuffling around of the operands are needed.

BTW, speaking of addmul_2. Where current addmul_2 wins over addmul_1,
that's because we get more independent mul instructions and can more
easily saturate multiplication units. At least, that's my understanding.

We've considered using karatsuba aka toom2 for addmul_2, but it has
always turned out that saving 1/4 of the multiply instructions is very
easily eaten up by the additional operations needed. But the other day,
it striked me that we might also try doing addmul_2 using toom32, which
would save 1/3 of the mul instructions. Toom32 is nice because we can
use the four easiest evaluation points: 0, infinity, and +/-1.

Or addmul_3 using toom32, which has the additional advantage that more
of the evaluation work is loop-invariant, and we could also jump to
separate innerloops depending on the carry bits from evaluation.

Perhaps this is still crazy, and useful only for machines with very slow
multiplication.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list