[PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Sat Feb 20 16:59:01 UTC 2021

Marius Hillenbrand <mhillen at linux.ibm.com> writes:

  Done, completed successfully for both mpn_addmul_1 and mpn_submul1.

Good!

  I measure ~40% reduction in cycles per limb on z14 and ~60% reduction on
  z15, for both addmul_1 and submul_1 (smaller gains for small n <10 but
  no regressions), yet ...

Great speedup!

Smaller gains for small loop trip counts are expected.

  > 2. Is the measured performance close to what you would hope for, given
  > some hard pipeline limits?  If not, would you be willing to try to
  > improve things?

  ... I assumed that I hit max multiplication throughput, yet I just
  learned that there is potential to increase that further. I'll accept
  that challenge and keep working on the patch :)

Good!

  I went through the process before and already have copyright assignments
  in place for multiple GNU projects, including GMP, so we should be good
  from that side.

Ah, that will simplify things.

Your addmul_1 and submul_1 will be great contributins.  Please do a
mul_1 too, as that should be very straightforward by removing some insn
from addmul_1.

If you really want to make these CPUs perform near their maximum
caapbility, consider writing mul_basecase, sqr_basecase, and perhap
mpn_sbpi1_bdiv_r too.

The last one might not be too clear, but it is really very similar to
mul_basecase, only that one innerloop invariant multiplier comes from a
Hensel limb quotient.

A fast sqr_basecase is tricky, and GMP uses a plethora of algorithms for
it.  The most recent saves a lot of cycles and look like below in C.  In
asm, one would inline addmul_1, of course.  And for both mul_basecase
and in particular sqr_basecase, one should implement straight line code
for small operands (this also tends to simplify the main code as it
doesn't need to consider small operands).  For sqr_basecase, one would
exit the outer loop long before un becomes zero, and fall into straight
line code.

    void
    mpn_sqr_basecase (mp_ptr rp, mp_srcptr up, mp_size_t un)
    {
      mp_limb_t u0;
      mp_limb_t cin;

      u0 = up[0];
      umul_ppmm (cin, rp[0], u0, u0);
      ++rp;

      if (--un) {
        u0 = u0 << 1;
        up += 1;

        rp[un] = mpn_mul_1c (rp, up, un, u0, cin);

        // error: up[0]:63 x up[(n-1)...1], analogous after each iteration below
        do {
          u0 = up[0];
          mp_limb_t ci = -(up[-1] >> (GMP_NUMB_BITS-1)) & u0; // correction term
          mp_limb_t x0 = rp[1] + ci;
          mp_limb_t c0 = x0 < ci;
          mp_limb_t hi, lo;

          umul_ppmm (hi, lo, u0, u0);
          mp_limb_t x1 = x0 + lo;
          mp_limb_t c1 = x1 < lo;
          cin = hi + c0 + c1;
          ASSERT (cin < hi);
          rp[1] = x1;
          rp += 2;

          if (--un == 0) break;
          u0 = (up[-1] >> (GMP_NUMB_BITS-1)) + (u0 << 1);
          up += 1;

          rp[un] = mpn_addmul_1c (rp, up, un, u0, cin);
        } while (1);
      }

      rp[0] = cin;
    }

-- 
Torbjörn
Please encrypt, key id 0xC8601622