GMP «Arithmetic without limitations» GMP developers' X86-64 corner



X86-64 core pipeline overview

Conroe
Penryn
Nehalem
Westmere
Sandy bridge Ivy bridge Haswell Broadwell Skylake Kaby lake
issue width 3 3 3 3 4 4 4 4
SIMD exec width 128 128 128 128 256 256 256 256

X86-64 optimisation background

GMP's performance on X86-64 chips is good. The main optimisation effort up until the 5.1 release was directed towards the AMD K8-K10 processors. Starting with GMP 6, the main effort was directed towards the Intel CPUs. With the release of AMD Zen, we optimise for both Intel and AMD CPUs.

Status

mul_basecase size method plan
Intel Atom (Diamondville, etc) generic
Intel Atom (Silvermont) generic
Intel Conroe/Wolfdale 2687 SW4(m14/m24; loop(am24))
Intel Nehalem/Westmere → Conroe/Wolfdale
Intel Sandy bridge (SBR) 951 SW(m14/m22) loop(SW am24) rewrite to use SW4
Intel Ivy bridge (IBR) → Sandy bridge
Intel Haswell (HWL) 1107 SW(m14/m24) loop(SW am24) rewrite to use SW4
Intel Broadwell (BWL) 840 SW(m18) loop(SW am18)
Intel Skylake (SKY) → Broadwell
AMD K8-K10 1099 SW(m14/m24) loop(JP am24)
AMD Bulldozer 950 SW(m14/m22) loop(SW am24)
AMD Piledriver → Bulldozer
AMD Zen 1396 SW4(m14) SW4(osploop(am14))
AMD Bobcat 1263 m14 SW(m1-tail; loop(am14))
AMD Jaguar
sqr_basecase size method plan
Intel Atom (Diamondville, etc) generic
Intel Atom (Silvermont) generic
Intel Conroe/Wolfdale 2761 SW4(m24 loop(am24→am24)) cor2x1
Intel Nehalem/Westmere → Conroe/Wolfdale
Intel Sandy bridge (SBR) 1168 SW(m22) loop(SW(am24)) cor2x1 rewrite to use SW4
Intel Ivy bridge (IBR) → Sandy bridge
Intel Haswell (HWL) 1304 SW(m22) loop(SW(am24)) cor2x1 rewrite to use SW4
Intel Broadwell (BWL) 2881 m18 SW(loop(am18→am18→am18→am18→am18→am18→am18→am18→)) alg:OTF cor3x2
Intel Skylake (SKY) → Broadwell
AMD K8-K10 2189 SW(m14/m24) loop(am24→am24) cor2x1 rewrite w/o m1
AMD Bulldozer → K8 written, but slowdown for important operand range
AMD Piledriver → K8 make → Bulldozer
AMD Zen 1472 SW(m14) SW4(loop (am14→am14→am14→am14→) alg:OTF
AMD Bobcat 1492 m14 SW(m1-tail; loop(am14→am14→am14→am14)) cor2x1
AMD Jaguar
redc_1 size method plan
Intel Atom (Diamondville, etc) 1394 SW4(loop(am14)) | pipelined q0 comp
Intel Atom (Silvermont) generic
Intel Conroe/Wolfdale 1074 SW2(loop(am12)) | pipelined q0 comp
Intel Nehalem/Westmere 1602 SW4(loop(am14)) | pipelined q0 comp
Intel Sandy bridge (SBR) 1553 SW4(loop(am14)) | pipelined q0 comp
Intel Ivy bridge (IBR) → Sandy bridge
Intel Haswell (HWL) 1187 SW4(loop(am14)) | pipelined q0 comp rewrite in new SBR code style
AMD K8-K10 1593 SW4(loop(am14)) | pipelined q0 comp | inlined add_n rewrite in new SBR code style
AMD Bulldozer → K8
AMD Piledriver → K8
AMD Zen 827
AMD Bobcat 1346 SW4(loop(am14)) | pipelined q0 comp
AMD Jaguar
mullo_basecase size method plan
Intel Atom (Diamondville, etc) generic
Intel Atom (Silvermont) generic
Intel Conroe/Wolfdale 996 SW(m24) loop(SW(am24)) cor2x1
Intel Nehalem/Westmere → Conroe/Wolfdale
Intel Sandy bridge (SBR) 916 SW(m22) loop(SW(am24)) cor2x1
Intel Ivy bridge (IBR) → Sandy bridge
Intel Haswell (HWL) 1049 SW(m24) loop(SW(am24)) cor2x1
AMD K8-K10 1002 m1/m2 -> am2 rewritten w/o m1, currently no speedup
AMD Bulldozer → K8
AMD Piledriver → K8
AMD Bobcat → K8
AMD Jaguar


Last modified: 2017-07-01