Please update addaddmul_1msb0.asm to support ABI in mingw64

Fri Oct 8 21:21:21 UTC 2021

I created an "ununrolled" version, and a 4x unrolled version.  I then
compared these with some other variants.  Here are the results:

       mul_1  addmul_1  addaddmul    result     best variant
zen1    2       2         2             =       all equal (saturated mul)
zen2    1.7     2.1       2.8           +       4x unrolled
zen3    1.5     1.5       3.9           -       1x/2x/4x unrolled
bwl     1.5     1.7       3.1           +       1x
skl     1.5     1.7       3.1		+       1x
rkl     1.1     1.6       2.7           =       1x

The only CPU which sees great improvements if zen2.  Both bwl and skl see
very minor improvement.  I don't see your 20% figure for bwl.

The result on zen3 are really poor.  I believe this CPU has quite some cost
for indexed memrefs.  I think that's also true for bwl and perhaps skl, even
if the new code runs OK there.

We might want to produce variants which uses plainer memory ops, code which
updates baseregs with lea.  That will require unrolling to at least 4x.
(The present addmul_1 code which is used for bwl, skl, rkl, and zen3 is 8x
unrolled without indexed memrefs.  It was loopmixed for bwl, but runs
surprisingly well elsewhere.)

We really should move the present code to the k8 subdir.  It is a major
slowdown everywhere I have tested except k8 and k10.  (I have not tested bd1
through bd4, but they have k8 in their asm paths (which might be good or
bad).)

If we want to pursue this more, this is what I would suggest:

1. Move the present code to x86_64/k8

2. Create a non-indexed variant, perhaps 4x unrolled.  I suspect this might
   give some real speedups for bwl, skl, perhaps rkl, and likely zen3.

3. If 2 is successful, commit code bwl subdir, create relevant grabber for
   the appropriate zen CPUs.  If 2 is not successful, commit "ununrolled"
   code for bwl subdir, make zen1/zen2 use it (but make sure zen3 does not
   use that variant!).

4. Consider loopmixing.

-- 
Torbjörn
Please encrypt, key id 0xC8601622