[PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Fri Mar 5 10:35:57 UTC 2021

On 3/2/21 1:55 PM, Torbjörn Granlund wrote:
> Marius Hillenbrand <mhillen at linux.ibm.com> writes:
> 
>   Most notably, I changed the order so that the mlgr's are next to each
>   other. The reason is that decode and dispatch happens in two "groups" of
>   up to three instructions each, with each group going into one of the two
>   "issue sides" of the core (both are symmetric and have  the same set of
>   issue ports since z13). For some instructions, grouping is restricted --
>   that includes mlgr, which will be alone in a group. Thus, placing two
>   mlgr's next to each other will ensure that they will spread across both
>   issue sides and exploit both multiply units.
> 
> This sort of restrictions are hard to deal with.
> 
> What happens around branches with these grouping restrictions?  Could an
> incoming branch (such as the loop branch) have one group from before the
> branch and one after the branch, thus invalidating the assumption of
> adjacent mlgr's going to different groups?

Branches predicted taken will end a group, so when we branch back in the
loop, grouping will be the same in each iteration. The first iteration
can differ when we do not branch into the loop; then, the instructions
before the loop can result in different group boundaries.

Though, the restrictions on mlgr are independent of that: each mlgr will
always be alone in a group, no matter what happens before.

> I've seen cases (with other pipelines) where a loop can take completely
> different times depending on some parity-like issue condition upon loop
> entry.  It never recovers in the "bad" case.

I am not aware that we could run into a condition like that. Both the
mlgrs and the taken branch while looping will reset grouping to a
defined and reproducible state.

>   In my experiments, incrementing and comparing a single "idx" turned out
>   beneficial over incrementing the pointers and decrementing n separately.
> 
> Doesn't brctg with its awareness of the induction variable help branch
> prediction in such a way that not only is branch back accurately
> predicted, but also the final fall-through?

Indeed, brctg will likely improve prediction accuracy for few
iterations. Not sure that it makes a difference for many iterations, though.

> OK, using brctg and whether to use idx is perhaps orthogonal.

Yes. I probably saw a small benefit of incrementing only the single
index, instead of all pointers.

Thanks for sharing your unrolled versions and the variant that jumps
into the loop. That's very interesting.

>   To improve performance notably (~40% over my initial patch on z15), my
>   currently best-performing implementation maintains the same instruction
>   sequence (mlgr + vacq for two limbs at a time) as our previous attempts,
>   yet unrolls for 8 limbs per iteration with software pipelining of the
>   different stages (load, multiply, add, and so on).
> 
>   Unrolling even more did not improve performance.
> 
> Did you get rid of the lgr of the carry limb?  That should not be too
> hard.  The code attached does that.

Yes. The 'vlvgp' are placed so that they can collect the carries without
the need of an lgr. That comes at the cost of more registers, yet I need
them anyway to spread loads / muls / vlvgp sufficiently.

> What is the performance improvement for going from 4x to 8x unrolling?

>From 4x unrolling to 8x unrolling with pipelinging we gain ~40%.

> Be careful about the lead-in times too.  With deep unrolling, one needs
> a table indexed by n % W, where W is the unrolling arity.

Since I found code complexity (and debugging effort, to be honest) raise
sharply when supporting flexible lead-ins, I am currently wrapping up a
hybrid variant for addmul_1 and mul_1: use the much simpler 2x unrolled
variant from my first patch when N is less than the pipelined lead-in +
one iteration + lead-out; when N is above the threshold, use that loop
for the n%W and then use the pipelined variant. As a plus, we need fewer
registers below the threshold. That variant achieves ~2x speedup for N <
18 and gets to within 20% of peak multiplication throughput above that.

Next, I plan to integrate with mul_basecase and sqr_basecase so that I
can get an indication of higher-level  speedup from gmp-bench. Also,
supporting "fat binaries" on s390x looks worthwhile. Then, I would
iterate back to making the lead-in more flexible and get rid of the 2x
unrolled loop for small N.

Marius
-- 
Marius Hillenbrand
Linux on Z development
IBM Deutschland Research & Development GmbH
Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart, HRB 243294