Please update addaddmul_1msb0.asm to support ABI in mingw64

Thu Oct 7 08:00:07 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  Gave it a run on my closest x86_64 (intel broadwell, no mulx)), and
  numbers for mpn_addaddmul_1msb0 are not impressing. Also, it appears
  mpn_addmul_2 is significantly slower than two addmul_1.

I believe addmul_2 is inhibited for that CPU.  It might still appear in
the compiled library, though.  :-(

  79            #1.5617        1.8006        4.3277        4.6949
  86            #1.5702        1.7883        4.3290        4.7031
  94            #1.5441        1.7743        4.3321        4.7018

  So there's definitely some room for improvement.

The odd instruction order of the present loop suggests it was optimised
for K8.  In fact, it runs almost optimally there.

(32 loop instructions, the 6 muls need a double slot, so 38.  3-way
issue, 6 way unrolled gives (32+6)/3/6 = 2.111...  Very close to the
stated 2.167.)

Beating mul_1 + addmul_1 elsewhere without loopmixing will probably be
hard.  We should probably move the present code into the k8 subdir.

-- 
Torbjörn
Please encrypt, key id 0xC8601622