[PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13
Torbjörn Granlund
tg at gmplib.org
Wed Mar 10 12:09:37 UTC 2021
Marius Hillenbrand <mhillen at linux.ibm.com> writes:
z14 introduced "alignment hints" for vector loads, where 8-byte aligned
reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte
alignment, 4 for 16-byte alignment). vlerg does not take these hints.
Empirically, I observe a slight advantage for vlerg nonetheless (~5%).
How does z13 interpret these hints? Ignore them, I hope!
My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of
mlgr throughput. Though, my current implementation only applies for >=
18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful
for addmul_1, besides making the case for going for addmul_2.
~20% is very good, indeed.
I see that the price is deep static instruction scheduling. That's
sometimes necessary, but it often adds significant O(1) overhead.
If we're really crazy, we could make what I sometimes refer to as
overlapped software pipelining. With tha, I mean that the outer loop of
e.g. mul_basecase combines the inner loop's wind-down code for outer
loop iteration j with the inner loop's feed-in code of iteration j+1.
But code complexity will probably be lower with addmul_2 or some such,
as we now have an inner loop with more inherit parallelism.
I'm looking at the addmul_2/3/4 variants, exploring parameters.
Given that mpn_mul requires s1n >= s2n, mul_basecase will always call
any variant of addmul_k with n > k (if I read the code correctly). Is
that an assumption that addmul_k can make in general?
Yes, it is. Or more correctly n >= k.
Note a trickyness of mul_k which has bit me before: It is allowed to
have the same source and destination, i.e., mul_2 (ap, ap, n, bp). My
mul_2 code therefore preloads from ap in a manner which my addmul_2 does
not. Of course, any slight software pipelining tends to take care of
this problem automatically.
And tests/devel/try knows how to trip code which gets this wrong.
I would suggest that you concentrate on an addmul_2 to see how close you
can get to mlgs's throughput without making its software pipeline overly
deep. Going to addmul_k for larger k tends to diminish the returns, and
furthermore requires mul_1, mul_2, mul_(k-1) in order to create the
*_basecase functions.
I made my addmul_2 actually work for all limb counts. (I think it even
unnecessarily handles n = 1.) Code attached.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: z14-addmul_2.asm
Type: application/octet-stream
Size: 3274 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20210310/ed74aa74/attachment-0001.obj>
-------------- next part --------------
--
Torbj?rn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list