GMP on Intel Haswell
tg at gmplib.org
Thu Aug 1 11:53:59 CEST 2013
I got a new Intel Haswell system for the GMP test system array. This
CPU line is interesting to GMP because of its improvements in the area
of integer arithmetic.
The undisputed GMP champion has for years been the now defunct AMD CPUs
K8 and K10. The most critical multiplication loops run at between 2.375
and 2.5 cycles per accumulated 64 x 64 -> 128 bit product.
No Intel system has come close, and newer AMD systems (Bulldozer,
Piledriver) run he loops at between 4.5 and 5.2 cycles per limb.
(New GMP code reaches 4.25 cycles.)
Haswell adds a new multiply instruction which avoids 2 of 3 fixed-
register operands. The old MUL did (rdx,rax) <- rax * regormem, while
the new MULX does (reg1,reg2) <- rdx * regormem. I suppose they kept
rdx fixed as a concession to the general x86 ugliness. :-)
Furthermore, MUL overwrites the carry flag with a useless value, while
MULX leaves flags alone.
The new instruction is much more suitable for GMP's needs.
I have written some preliminary loops using MULX, and optimised them for
Haswell. The results are encouraging; this CPU has the potential to
outperform all other x86 CPUs. The key multiply loops run at between
1.6 and 2.3 cycles/limb, resulting in about 20% higher performance than
on the old K10.
Thus far, only mul_1 (1.6 cycles/limb), and addmul_1/submul_1 (2.3
cycles/limb) are in the public repo.
I have a 1.75 c/l mul_2 and 2.0 c/l addmul_2 in the assembly works. I
strongly suspect it is possible to do addmul at considerably less than
(A caveat about the new system: Perhaps I was unlucky, or perhaps the
platform in not yet robust, but the first system I got had a dead CPU,
and the second is not 100% stable under GNU/Linux; I get rare spurious
non-reproducible segfaults. Neither FreeBSD, Debian, or Ubuntu would
work at all; they crashed in strange ways during install. Finally
Gentoo installed, but has the segfault problem.)
More information about the gmp-devel