[PATCH] Add addmul_1, addmul_2, and mul_basecase for IBM z13 and later

Thu Jul 22 14:04:02 UTC 2021

Hi,

Based on your feedback on my previous patches, I rewrote addmul_1/mul_1 and
added implementations for addmul_2/mul_2 and mul_basecase. They are still based
on multiplying 64x64->128 in gpr pairs and accumulating 128-bit-wise in vector
registers.

The code passes "make check", of course, and I have run "try" for ~72 hours for
each of the functions (on top of countless iterations of the relevant individual
test cases in tests/devel).

GMPbench.base.multiply improves by about 50% on z15, the overall
score in GMPbench improves by ~35%. The patches do not include new tuneup
parameters, yet.

All the implementations are in C with enough inline assembly to result in decent
code. mul_basecase #includes and inlines the (add)mul functions to avoid calls
and unnecessary branches.

All the (add)mul_1/2 functions are 4x unrolled for the first operand (i.e., 4
mults per iteration in addmul_1, 8 mults in addmul_2). Mul_basecase is
structured so that it branches on (un % 4) to select the correct loop prologue
only once on entry, and does not need branches for that in each body of addmul.

The accumulation structure in addmul_2 is maybe a little unexpected.  The idea
there is to use 128-bit adds without carry over full adds with carry-in and
carry-out whenever possible because the latter require two instructions for each
sum and have instruction grouping limitations. The resulting code performs
better than strictly using adds with carry-in/out for the moderate number of
limbs that are relevant for mul_basecase.

Regards,
Marius