[PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Tue Mar 2 10:39:11 UTC 2021

Hi,

On 3/2/21 12:30 AM, Torbjörn Granlund wrote:
> Torbjörn Granlund <tg at gmplib.org> writes:
> 
>   I played a bit with an addmul_1 of my own, with some ideas from your
>   code.  I don't plan to do more work on this.  Does this perform well on
>   hardware?
> 
> I now realise that the instruction sequence of my example is essentially
> the same as in your code, except with more unrolling.  That's a bit
> surprising, as I wrote it by understanding the instruction set, starting
> at the vac* insn which your code mase me look at.  I did actually not
> look at your code while writing my code.

Indeed, it is intriguing that we were both led to this instruction
selection. Further, that same basic sequence has held up in my current
best-performing variant. Though, let me answer your questions and
comments one by one.

I have run & benchmarked your code, with a few changes. It performs
significantly better than my initial submission (because of the
unrolling) and close to my second-best variant (same unrolling,
different schedule).

Most notably, I changed the order so that the mlgr's are next to each
other. The reason is that decode and dispatch happens in two "groups" of
up to three instructions each, with each group going into one of the two
"issue sides" of the core (both are symmetric and have  the same set of
issue ports since z13). For some instructions, grouping is restricted --
that includes mlgr, which will be alone in a group. Thus, placing two
mlgr's next to each other will ensure that they will spread across both
issue sides and exploit both multiply units.

(there is a "IBM Z / LinuxONE System Processor Optimization Primer"
publicly available from
https://community.ibm.com/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=d1cdb394-0159-464c-92a3-3f74f8c545c4
discussing these details)

your code with adjusted grouping:

L(top): lg      %r1, 0(%r13, %r3)
        lg      %r7, 8(%r13, %r3)
        lg      %r9, 16(%r13, %r3)

        mlgr    %r0, %r5
        mlgr    %r6, %r5

        lg      %r11, 24(%r13, %r3)
        mlgr    %r8, %r5
        mlgr    %r10, %r5

> (your comment) Remove "idx" and instead decrement n, update up, and rp.

In my experiments, incrementing and comparing a single "idx" turned out
beneficial over incrementing the pointers and decrementing n separately.

Similarly, using 128-bit adds in vector registers performs better than
alcgr + algr. One factor is that alcgr must be alone in a dispatch
group, same as mlgr. Given the number of alcgrs we would need, the
128-bit adds wins. For comparison, vacq and vacccq also have a grouping
limitation -- only two of them can be in a group. However, that means we
can fit a 128-bit add with carry in and out in one dispatch group,
instead of just a 64-bit add.

To improve performance notably (~40% over my initial patch on z15), my
currently best-performing implementation maintains the same instruction
sequence (mlgr + vacq for two limbs at a time) as our previous attempts,
yet unrolls for 8 limbs per iteration with software pipelining of the
different stages (load, multiply, add, and so on).

Unrolling even more did not improve performance.

To cope with the complexity, the code is in C with a mix of GCC vector
extensions, Z builtins for vectors, and inline assembly (to enforce the
mlgr scheduling and trick GCC in some cases):

/* ... ramup ... */
for ( ) { /* four limb pairs 0 .. 3 in flight */
        /* first limb pair */
        LOAD_LP1(2 + 8);  /* loading next limb pair */
        MULT_STAGE_0;
        SECOND_ADD_STAGE(1); /* load and mult in rampup, now add */
        FIRST_ADD_STAGE(2);
        WRITEBACK_STAGE(0, 0);
        TRANSFER_TO_VR_0;
        asm ("");

        /* second limb pair */
        LOAD_LP2(4 + 8);
        MULT_STAGE_1;
        SECOND_ADD_STAGE(2);
        FIRST_ADD_STAGE(3);
        WRITEBACK_STAGE(1, 2);
        TRANSFER_TO_VR_1;
        asm ("");

	/* third and fourth limb pair ... */
}
/* wrapup ... */

While this variant helped a lot in debugging and tweaking parameters and
schedule, it is hackish and brittle (e.g., the empty asm("")s help
define instruction scheduling, yet GCC may change how it handles them
over time). Further, I suspect there may be performance gains left in
hand-tweaking the assembly code.

So, for integrating this implementation into GMP, I propose adding both
the resulting assembly variant and that C code for reference or future
improvements.

What do you think?

Marius
-- 
Marius Hillenbrand
Linux on Z development
IBM Deutschland Research & Development GmbH
Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart, HRB 243294