[PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

Wed Mar 10 12:09:37 UTC 2021

Marius Hillenbrand <mhillen at linux.ibm.com> writes:

  z14 introduced "alignment hints" for vector loads, where 8-byte aligned
  reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte
  alignment, 4 for 16-byte alignment). vlerg does not take these hints.
  Empirically, I observe a slight advantage for vlerg nonetheless (~5%).

How does z13 interpret these hints?  Ignore them, I hope!

  My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of
  mlgr throughput. Though, my current implementation only applies for >=
  18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful
  for addmul_1, besides making the case for going for addmul_2.

~20% is very good, indeed.

I see that the price is deep static instruction scheduling.  That's
sometimes necessary, but it often adds significant O(1) overhead.

If we're really crazy, we could make what I sometimes refer to as
overlapped software pipelining.  With tha, I mean that the outer loop of
e.g. mul_basecase combines the inner loop's wind-down code for outer
loop iteration j with the inner loop's feed-in code of iteration j+1.

But code complexity will probably be lower with addmul_2 or some such,
as we now have an inner loop with more inherit parallelism.

  I'm looking at the addmul_2/3/4 variants, exploring parameters.

  Given that mpn_mul requires s1n >= s2n, mul_basecase will always call
  any variant of addmul_k with n > k (if I read the code correctly). Is
  that an assumption that addmul_k can make in general?

Yes, it is.  Or more correctly n >= k.

Note a trickyness of mul_k which has bit me before: It is allowed to
have the same source and destination, i.e., mul_2 (ap, ap, n, bp).  My
mul_2 code therefore preloads from ap in a manner which my addmul_2 does
not.  Of course, any slight software pipelining tends to take care of
this problem automatically.

And tests/devel/try knows how to trip code which gets this wrong.

I would suggest that you concentrate on an addmul_2 to see how close you
can get to mlgs's throughput without making its software pipeline overly
deep.  Going to addmul_k for larger k tends to diminish the returns, and
furthermore requires mul_1, mul_2, mul_(k-1) in order to create the
*_basecase functions.

I made my addmul_2 actually work for all limb counts.  (I think it even
unnecessarily handles n = 1.)  Code attached.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: z14-addmul_2.asm
Type: application/octet-stream
Size: 3274 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20210310/ed74aa74/attachment-0001.obj>
-------------- next part --------------

-- 
Torbj?rn
Please encrypt, key id 0xC8601622