[PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13

Wed Mar 10 11:23:59 UTC 2021

Hi,

On 3/9/21 11:06 PM, Torbjörn Granlund wrote:
> Marius Hillenbrand <mhillen at linux.ibm.com> writes:
[...]
>   That absolutely makes sense. When I wrote my patches initially, it was
>   not yet clear that it is worthwhile to differentiate.
> 
> It is not clear, but it does not hurt to make config.guess be accurate,
> and then treat the CPUs the same way.  In my experience, people can get
> confiused when GMP claims they have CPU foo-k when they actually have
> foo-(k+1).

OK, agreed. I had vlerg/vsterg vs vpdi in mind.

> I tried vlerg on the system here, and it works fine.  Very little timing
> differece though, but then again I didn't try very hard.
> 
> I am not aware of any timing differences between z13, z14, z15 for the
> L1 cache-hit cases.  Are there any?  And the only GMP-relevant ISA
> difference of which I am aware is the presence of vlerg in z15.

z14 introduced "alignment hints" for vector loads, where 8-byte aligned
reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte
alignment, 4 for 16-byte alignment). vlerg does not take these hints.
Empirically, I observe a slight advantage for vlerg nonetheless (~5%).

> How's it going with the various addmul_k variants?  My completely
> non-scheduled addmul_2 seems to run 37% slower than the mlgr throughput.
> That's not bad.  Some fiddling around with the schedule got me to just
> 25% slower.  That was with 2x unrolling.  I haven't tried anything
> sophisticated.

That is very good news.

> 
> How far is your best addmul_1 from mlgr's throughput?

My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of
mlgr throughput. Though, my current implementation only applies for >=
18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful
for addmul_1, besides making the case for going for addmul_2.

The loop is unrolled for 8 multiplications (4 "limb pairs" of 128 bits
each) and looks like (simplified)
for( ...) {
   LOAD(limb pair 1);
   MULT (0);
   SECOND ADD (1);
   FIRST ADD (2);
   WRITEBACK (0); // from previous iteration or lead-in
   VLVGP (0);

   LOAD(2);
   MULT(1);
   SECOND ADD (2);
   FIRST ADD (3);
   WRITEBACK (1);
   VLVGP (1);

   LOAD(3);
   MULT(2);
   WRITEBACK(2);
   SECOND ADD(3);
   FIRST ADD (0); // from mult and vlvgp at top
   /* ... and so on ... */
}

> 
> I believe it to be possible to get pretty close to mlgr's throughput, if
> not by any other means by going to addmul_k for k > 2.  I think 8-way
> addmul_1 makes little sense, but I think 2-way or 4-way addmul_2, or
> 2-way addmul_3 or 2-way addmul_4 does make sense if they run close to
> mlgr's throughput.

I'm looking at the addmul_2/3/4 variants, exploring parameters.

Given that mpn_mul requires s1n >= s2n, mul_basecase will always call
any variant of addmul_k with n > k (if I read the code correctly). Is
that an assumption that addmul_k can make in general?

Marius
-- 
Marius Hillenbrand
Linux on Z development
IBM Deutschland Research & Development GmbH
Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart, HRB 243294