[PATCH] Add optimized addmul_1 and submul_1 for IBM z13
tg at gmplib.org
Fri Mar 5 11:52:09 UTC 2021
>From 4x unrolling to 8x unrolling with pipelinging we gain ~40%.
Wow, that's much more than I would have expected (or actually seen at my
Which brings up another question: Which zarch pipelines does it make
sense to optimise for? I thought I had z196 access, but that is
probably not true as I didcovered that it supports the vector insns. It
must be a z13 or later, I think.
I don't mind adding code for older CPUs at all. I want to avoid slowing
down GMP for any CPU, old or new, so adding code which affects a CPU
should not be done without benchmarking on that CPU.
However, on that machine, a 4x variant of my code runs addmul_1 at about
3.4 cycles/limb and an 8x variant runs at about 3.3 cycles/limb. That
difference is far smaller than what you see, either because we are on
different CPUs or because your 8x variant is far better than mine.
I tested a mlgr-only loop without any dependencies. It indicates an
mlgr throughput of 0.5 insn/cycle. Thus, we cannot hope for anymul_1 to
run at less than 2 cycles/limb.
If your cycles/limb numbers are not better than mine, there is still
considerable room for improvements, 3.3 is still far from 2.0!
I haven't thought carefully about it, but an addmul_2 with vector
accumulation might help. An addmul_2 loop use half the number of loads
and stores, and use fewer vpdi, compared to addmul_1. And perhaps one
could accumulate cleverly to avoid some of the carries. Or not.
By the way, have you tried using vlerg/vsterg to avoid vpdi?
(I am having problems understanding which instructions are implemented
in a particular CPU. The "Principles of Operations" manual seem to skip
that aspect. I tried vlerg but it was not even recognised by the
> Be careful about the lead-in times too. With deep unrolling, one needs
> a table indexed by n % W, where W is the unrolling arity.
Since I found code complexity (and debugging effort, to be honest) raise
sharply when supporting flexible lead-ins, I am currently wrapping up a
hybrid variant for addmul_1 and mul_1: use the much simpler 2x unrolled
variant from my first patch when N is less than the pipelined lead-in +
one iteration + lead-out; when N is above the threshold, use that loop
for the n%W and then use the pipelined variant. As a plus, we need fewer
registers below the threshold. That variant achieves ~2x speedup for N <
18 and gets to within 20% of peak multiplication throughput above that.
A mistake can be made here, and that is optimising addmul_k for too
large limb counts. We've made that mistake a lot with GMP in the past.
So why is that a mistake? Because the use of divide-and-conquer
algorithms make large trip counts exceedingly rare.
I tested on the z13(?) and found that Karatsuba's algorithm is used
already from 8 limbs for multiplication and 11 limbs for squaring.
That's with the 4x vector accumulation addmul_1.
I expect mul_basecase/sqr_basecase would increase these numbers a but,
as they cut O(n) of the execution time.
I would still be surprised if 8x addmul_1 unrolling help in practice,
but I would love to be proven wrong.
For completeness of my reasoning, there is one rare case where addmul
trip counts will be large: a very unbalaned multiplication where one
operand is >> than the other.
To make mul_basecase as low-overhead as possible, one need to do the un
mod k (k being the unrolling factor) before the k outer loops. Yes,
outer loops, plural. The lead-in of the k inner loops then can be
condition-less straight-line code. Yes, major code expansion, in
particular for large k, but mul_basecase and sqr_basecase are super
The structure of sqr_basecase will become different, as the inner loop
trip counts will decrease by one in the outer loop. Still, one should
do not n mod k except in the function header code. Instead, one need to
branch into k different places inside the one outer loop. Tricky, yes.
Neat, not at all.
I am all for fat support for zaxrh.
Please encrypt, key id 0xC8601622
More information about the gmp-devel