[PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Mon Mar 1 11:01:23 UTC 2021

My analysis and reports in this thread had several problems.  For
example, I had made shared-lib builds of some qemu images used for "user
mode" emulation; that does not work unless the host dynlibs are made
available in the guest file system.

Trying again: your submul_1 works fine on all tested versions of qemu,
which are 4.1.x through 5.2.0.  (But I did not test every version in
that range.)

The problems I saw with qemu and submul_1 where actually with the
existing s390 submul_1 which has been part of GMP for several years.
All tested qemu versions exhibit that same problem.  I have access to a
s196 system, and there I cannot provoke any error.

So, the old submul_1 code seems to work on actual hardware.  The code
looks good to me.  There is a bug in qemu.

I only tested "user mode" qemu for this experiment.  Sometimes full
system emuation runs instructions correctly when "user mode" emuation
does not.  Sometimes vice versa.

(I've encountered lots of bugs of this kind in qemu over the years.  I
have tried to help the qemu project as I have expertise in the areas of
CPUs, emulation and arithmetic, but I was discouraged enough by their
culture to no longer even try to take the time needed to report my
findings.)

I have dealt with qemu's buggyness by keeping many qemu versions
installed.  I then try to locate the most recent which works for running
GMP in either full system emulation or on "user mode" emulation.

  In the meantime, I've been working on a software pipelined variant of
  addmul_1, which improves significantly over my previous patch. That is C
  with inline assembly, which helps somewhat with the increased
  complexity. It needs more tuning and stress testing, though.

Great!

Usually, multiplication insn throughput is the limiting factor for
addmul_1 and friends.  Therefore, understanding its throughput is a
great place to start.  Once that is understood, one knows what to aim
for.

One usually can get quite close to the multiplication insn throughput in
addmul_1 (or in some cases addmul_2, or addmul_k for some small k > 2).
But usually, and in particular if that throughput is great, the end
performance will be up to 50% worse.

I only know of one CPU where addmul_1 runs at exactly the multiplication
insn throughput; Apple M1.

I looked briefly at your code after David sent a link to the s390 ISA
manual.  If I understand it correctly, you rely on 128 bit addition.  I
have no idea of the throughput or latency of those 128-bit instructions,
but if those numbers are good, I agree that they might be very useful.

An alternative is to stick to plain 64-bit (non-vector) instructions for
unrolled addmul_1.  We do that for many CPUs already.  One will need to
run through partial products twice, for s390 using alcg(r).  The most
significant 64-bit partial product of an unroll group can work as a
carry trap.

If we pull out all the stops, perhaps addmul_4 or something like that,
combined with the vector instructions could yield the best performance.
In the end, balancing complexity and performance (and effort!) will
decide.

-- 
Torbjörn
Please encrypt, key id 0xC8601622