[PATCH] Improve System z support and add some tuning
tg at gmplib.org
Fri Oct 7 11:48:54 CEST 2011
Andreas Krebbel <krebbel at linux.vnet.ibm.com> writes:
As David already told you there are mainframe virtual machines available for developers:
I read the conditions, and they seem alright.
But a limit of 90 days makes this uninteresting for GMP. I intend to
support GMP for more than 90 more days. :-)
> (2) Write some more crititcal assembly routines, at least submul_1 and
> invert_limb. (I assume the latter will beat division, that infact
> division instructions should never ever be used, just like on
> x86_64. But I don't know the quotient time(dlgr)/time(mlgr), which
> basically is what determines this.)
I can help with this after finishing the paperwork. This unfortunately
will take a while.
The basic unoptimised set is now in place.
You're welcome to take a look to see if the insn mix seem good, and
perhaps run timing tests to see how far we are from the pipeline limits.
If we implement mul_basecase (which is needed for really good
performance) then for each limb product, we need lgr+mlgr+alcgr+algcr or
mlg+alcgr+alcgr. We will also need preloading of some values and some
An estimate of how far the current code is from what is possible, could
be done by analysing the above instructions' resource needs and compare
that to the results of these measurements:
tune/speed -c -s1-50 mpn_mul.3 mpn_addmul.3
tune/speed -c -s1-50 mpn_mul_basecase
> (3) Improve inline assembly 32-bit support for processors with support
> for MLR/ALR/ALCR etc. I notice that you made an effort along these
> lines, but I am not sure it was done right. At least, the gcc on my
> Debian system does not define the predef your code relies on.
mlr/dlr are available since z900 in both esa and z/architecture mode. There isn't a macro
defined by GCC for each CPU level. However these can be used for the -m31 -mzarch mode in
longlong.h since there is the __zarch__ macro defined and zarch requires at least z900. So
with my patch the instructions are used for the s390x ABI=32 build.
You are right, in fact they could be used for s390 -march=z900 as well. But since s390 is
only rarely used anymore I think the ABI=32 s390x build should always be used for 32 bit
code so I would like to focus on this one regarding optimizations.
OK. We can also set predef macros from GMP's configure.
But I suppose we can ignore the obsolete machines, at least for now.
> (4) Should we perhaps use 64-bit limbs for the 31-bit ABI, when using a
> 64-bit processor? As far as I understand, this should work, and it
> would run much faster. (This would be akin to the N32 MIPS ABI and
> the HPPA 2.0N ABI.)
Yes I also thought about this but didn't implement it yet. GCC is
already (since 4.6.0) using 64 bit registers in 32 bit code when
compiling with -m31 -mzarch. This requires the kernel to save/restore
the upper 32 bits of the 64 bit registers when doing signal
handling. Here a link to the GCC patch for the gory details:
So we have to consider 3 GMPABIs:
1. 31-bit addresses and 32-bit limbs (31/32)
2. 31-bit addresses and 64-bit limbs (31/64)
3. 64-bit addresses and 64-bit limbs (64/64)
Perhaps the latter two culd share some/all assembly code, if we're
careful about which address update instructions we use?
Is there hardware that have ML/MLR but not MLG/MLGR? I suppose such
macines than only support up to 31-bit addresses. (By "have" I mean
that it is implemented directly in hardware, not by a trap or in
We really want to use large limbs when possible, the performance gain is
close to 4x (for smallish operands) but is typically 3x.
More information about the gmp-devel