Please update addaddmul_1msb0.asm to support ABI in mingw64
Torbjörn Granlund
tg at gmplib.org
Fri Oct 8 21:21:21 UTC 2021
I created an "ununrolled" version, and a 4x unrolled version. I then
compared these with some other variants. Here are the results:
mul_1 addmul_1 addaddmul result best variant
zen1 2 2 2 = all equal (saturated mul)
zen2 1.7 2.1 2.8 + 4x unrolled
zen3 1.5 1.5 3.9 - 1x/2x/4x unrolled
bwl 1.5 1.7 3.1 + 1x
skl 1.5 1.7 3.1 + 1x
rkl 1.1 1.6 2.7 = 1x
The only CPU which sees great improvements if zen2. Both bwl and skl see
very minor improvement. I don't see your 20% figure for bwl.
The result on zen3 are really poor. I believe this CPU has quite some cost
for indexed memrefs. I think that's also true for bwl and perhaps skl, even
if the new code runs OK there.
We might want to produce variants which uses plainer memory ops, code which
updates baseregs with lea. That will require unrolling to at least 4x.
(The present addmul_1 code which is used for bwl, skl, rkl, and zen3 is 8x
unrolled without indexed memrefs. It was loopmixed for bwl, but runs
surprisingly well elsewhere.)
We really should move the present code to the k8 subdir. It is a major
slowdown everywhere I have tested except k8 and k10. (I have not tested bd1
through bd4, but they have k8 in their asm paths (which might be good or
bad).)
If we want to pursue this more, this is what I would suggest:
1. Move the present code to x86_64/k8
2. Create a non-indexed variant, perhaps 4x unrolled. I suspect this might
give some real speedups for bwl, skl, perhaps rkl, and likely zen3.
3. If 2 is successful, commit code bwl subdir, create relevant grabber for
the appropriate zen CPUs. If 2 is not successful, commit "ununrolled"
code for bwl subdir, make zen1/zen2 use it (but make sure zen3 does not
use that variant!).
4. Consider loopmixing.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list