[PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Fri Mar 5 11:52:09 UTC 2021

  >From 4x unrolling to 8x unrolling with pipelinging we gain ~40%.

Wow, that's much more than I would have expected (or actually seen at my
end).

Which brings up another question: Which zarch pipelines does it make
sense to optimise for?  I thought I had z196 access, but that is
probably not true as I didcovered that it supports the vector insns.  It
must be a z13 or later, I think.

I don't mind adding code for older CPUs at all.  I want to avoid slowing
down GMP for any CPU, old or new, so adding code which affects a CPU
should not be done without benchmarking on that CPU.

However, on that machine, a 4x variant of my code runs addmul_1 at about
3.4 cycles/limb and an 8x variant runs at about 3.3 cycles/limb.  That
difference is far smaller than what you see, either because we are on
different CPUs or because your 8x variant is far better than mine.

I tested a mlgr-only loop without any dependencies.  It indicates an
mlgr throughput of 0.5 insn/cycle.  Thus, we cannot hope for anymul_1 to
run at less than 2 cycles/limb.

If your cycles/limb numbers are not better than mine, there is still
considerable room for improvements, 3.3 is still far from 2.0!

I haven't thought carefully about it, but an addmul_2 with vector
accumulation might help.  An addmul_2 loop use half the number of loads
and stores, and use fewer vpdi, compared to addmul_1.  And perhaps one
could accumulate cleverly to avoid some of the carries.  Or not.

By the way, have you tried using vlerg/vsterg to avoid vpdi?

(I am having problems understanding which instructions are implemented
in a particular CPU.  The "Principles of Operations" manual seem to skip
that aspect.  I tried vlerg but it was not even recognised by the
assembler.)

  > Be careful about the lead-in times too.  With deep unrolling, one needs
  > a table indexed by n % W, where W is the unrolling arity.

  Since I found code complexity (and debugging effort, to be honest) raise
  sharply when supporting flexible lead-ins, I am currently wrapping up a
  hybrid variant for addmul_1 and mul_1: use the much simpler 2x unrolled
  variant from my first patch when N is less than the pipelined lead-in +
  one iteration + lead-out; when N is above the threshold, use that loop
  for the n%W and then use the pipelined variant. As a plus, we need fewer
  registers below the threshold. That variant achieves ~2x speedup for N <
  18 and gets to within 20% of peak multiplication throughput above that.

Interesting!

A mistake can be made here, and that is optimising addmul_k for too
large limb counts.  We've made that mistake a lot with GMP in the past.

So why is that a mistake?  Because the use of divide-and-conquer
algorithms make large trip counts exceedingly rare.

I tested on the z13(?) and found that Karatsuba's algorithm is used
already from 8 limbs for multiplication and 11 limbs for squaring.
That's with the 4x vector accumulation addmul_1.

I expect mul_basecase/sqr_basecase would increase these numbers a but,
as they cut O(n) of the execution time.

I would still be surprised if 8x addmul_1 unrolling help in practice,
but I would love to be proven wrong.

For completeness of my reasoning, there is one rare case where addmul
trip counts will be large: a very unbalaned multiplication where one
operand is >> than the other.

To make mul_basecase as low-overhead as possible, one need to do the un
mod k (k being the unrolling factor) before the k outer loops.  Yes,
outer loops, plural.  The lead-in of the k inner loops then can be
condition-less straight-line code.  Yes, major code expansion, in
particular for large k, but mul_basecase and sqr_basecase are super
important.

The structure of sqr_basecase will become different, as the inner loop
trip counts will decrease by one in the outer loop.  Still, one should
do not n mod k except in the function header code.  Instead, one need to
branch into k different places inside the one outer loop.  Tricky, yes.
Neat, not at all.

I am all for fat support for zaxrh.

-- 
Torbjörn
Please encrypt, key id 0xC8601622