GMP's x86-32 performance
Niels Möller
nisse at lysator.liu.se
Sat Jun 17 13:06:27 UTC 2017
tg at gmplib.org (Torbjörn Granlund) writes:
> Our latest batch of x86-32 code dates from 2011 (for the original Intel
> atom) but we have not done anything for high-end AMD and Intel CPUs
> (e.g., AMD k10, bulldozer, piledriver, steamroller, excavator, zen, or
> Intel penryn, nehalem, sandybridge, ivybridge, haswell, broadwell,
> skylake, kabylake) in a very long time.
When are those later cpus run in 32-bit mode? M$ windows or mac
applications? I would have expected 64_64 mode, possibly with some use
of the x32 abi (small pointers), to be used almost exclusively by now.
> What do I have in mind? I believe pmovzxdq, pmuludq, psrlq (or some
> shuffle insn), and paddq could be used to build an addmul_2 which runs
> at at close to 1 cycle/limb using sse2,
I think I looked at pmuludq in the past, the variant doing two 32x32->64
multiplies, without having any success. IIRC, the throughput of that
instruction on then current cpus was too poor to make it useful. Other
possible reasons for failure: (i) I didn't try hard enough, (ii) too
much shuffling around of the operands are needed.
BTW, speaking of addmul_2. Where current addmul_2 wins over addmul_1,
that's because we get more independent mul instructions and can more
easily saturate multiplication units. At least, that's my understanding.
We've considered using karatsuba aka toom2 for addmul_2, but it has
always turned out that saving 1/4 of the multiply instructions is very
easily eaten up by the additional operations needed. But the other day,
it striked me that we might also try doing addmul_2 using toom32, which
would save 1/3 of the mul instructions. Toom32 is nice because we can
use the four easiest evaluation points: 0, infinity, and +/-1.
Or addmul_3 using toom32, which has the additional advantage that more
of the evaluation work is loop-invariant, and we could also jump to
separate innerloops depending on the carry bits from evaluation.
Perhaps this is still crazy, and useful only for machines with very slow
multiplication.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list