[PATCH v2] Add addmul_1, addmul_2, and mul_basecase for IBM z13 and later

Thu Aug 5 07:03:50 UTC 2021

Hi,

Changes from v1:
  - add tuneup results from z13
  - fix mul_basecase to use #included and inlined addmul_2

Based on your feedback on my previous patches, I rewrote addmul_1/mul_1
and added implementations for addmul_2/mul_2 and mul_basecase. They are
still based on multiplying 64x64->128 in gpr pairs and accumulating
128-bit-wise in vector registers.

The code passes "make check", of course, and I have run "try" for ~72
hours for each of the functions (on top of countless iterations of the
relevant individual test cases in tests/devel).

GMPbench.base.multiply improves by about 50% on z15, the overall score
in GMPbench improves by ~35%. The patches do not include new tuneup
parameters, yet.

All the implementations are in C with enough inline assembly to result
in decent code. mul_basecase #includes and inlines the (add)mul
functions to avoid calls and unnecessary branches.

All the (add)mul_1/2 functions are 4x unrolled for the first operand
(i.e., 4 mults per iteration in addmul_1, 8 mults in addmul_2).
Mul_basecase is structured so that it branches on (un % 4) to select the
correct loop prologue only once on entry, and does not need branches for
that in each body of addmul.

The accumulation structure in addmul_2 is maybe a little unexpected. The
idea there is to use 128-bit adds without carry over full adds with
carry-in and carry-out whenever possible because the latter require two
instructions for each sum and have instruction grouping limitations. The
resulting code performs better than strictly using adds with
carry-in/out for the moderate number of limbs that are relevant for
mul_basecase.

Regards,
Marius