[PATCH 2/4] config.guess, configure.ac: Add detection of IBM z13
Marius Hillenbrand
mhillen at linux.ibm.com
Wed Mar 10 11:23:59 UTC 2021
Hi,
On 3/9/21 11:06 PM, Torbjörn Granlund wrote:
> Marius Hillenbrand <mhillen at linux.ibm.com> writes:
[...]
> That absolutely makes sense. When I wrote my patches initially, it was
> not yet clear that it is worthwhile to differentiate.
>
> It is not clear, but it does not hurt to make config.guess be accurate,
> and then treat the CPUs the same way. In my experience, people can get
> confiused when GMP claims they have CPU foo-k when they actually have
> foo-(k+1).
OK, agreed. I had vlerg/vsterg vs vpdi in mind.
> I tried vlerg on the system here, and it works fine. Very little timing
> differece though, but then again I didn't try very hard.
>
> I am not aware of any timing differences between z13, z14, z15 for the
> L1 cache-hit cases. Are there any? And the only GMP-relevant ISA
> difference of which I am aware is the presence of vlerg in z15.
z14 introduced "alignment hints" for vector loads, where 8-byte aligned
reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte
alignment, 4 for 16-byte alignment). vlerg does not take these hints.
Empirically, I observe a slight advantage for vlerg nonetheless (~5%).
> How's it going with the various addmul_k variants? My completely
> non-scheduled addmul_2 seems to run 37% slower than the mlgr throughput.
> That's not bad. Some fiddling around with the schedule got me to just
> 25% slower. That was with 2x unrolling. I haven't tried anything
> sophisticated.
That is very good news.
>
> How far is your best addmul_1 from mlgr's throughput?
My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of
mlgr throughput. Though, my current implementation only applies for >=
18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful
for addmul_1, besides making the case for going for addmul_2.
The loop is unrolled for 8 multiplications (4 "limb pairs" of 128 bits
each) and looks like (simplified)
for( ...) {
LOAD(limb pair 1);
MULT (0);
SECOND ADD (1);
FIRST ADD (2);
WRITEBACK (0); // from previous iteration or lead-in
VLVGP (0);
LOAD(2);
MULT(1);
SECOND ADD (2);
FIRST ADD (3);
WRITEBACK (1);
VLVGP (1);
LOAD(3);
MULT(2);
WRITEBACK(2);
SECOND ADD(3);
FIRST ADD (0); // from mult and vlvgp at top
/* ... and so on ... */
}
>
> I believe it to be possible to get pretty close to mlgr's throughput, if
> not by any other means by going to addmul_k for k > 2. I think 8-way
> addmul_1 makes little sense, but I think 2-way or 4-way addmul_2, or
> 2-way addmul_3 or 2-way addmul_4 does make sense if they run close to
> mlgr's throughput.
I'm looking at the addmul_2/3/4 variants, exploring parameters.
Given that mpn_mul requires s1n >= s2n, mul_basecase will always call
any variant of addmul_k with n > k (if I read the code correctly). Is
that an assumption that addmul_k can make in general?
Marius
--
Marius Hillenbrand
Linux on Z development
IBM Deutschland Research & Development GmbH
Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart, HRB 243294
More information about the gmp-devel
mailing list