Please update addaddmul_1msb0.asm to support ABI in mingw64

Wed Oct 6 09:50:19 UTC 2021

I haven't followed this discussion very closely, and did not see if you
have conidered the following.

OK, so the code is 3-ways unrolled.  That's always a bit inconvenient
and tends to cause some code bloat.

I am pretty sure we have that at least in sme other place, but still
make all the work in one loop, switching into apropriate places from the
feed-in code.

It is not expensice to compute something like

  (3^(-1) mod 2^32)*n mod 2^32 / 2^30

in the feed-in code.  (3^(-1) mod 2^32) = 0xaaaaaaab, so we can do the
above with two instructions (imul and shr).  The latency of umul+shr is
<= 4 on moderna architectures.

Since addaddmul_1msb0 is strictly internal, and since it presumably is
used for very limited values of n, I assume 32-bit arithmetic on n is
suffficient.

(Note that the tricky mod computation above "maps" the remainder 1 to 2
and the remainder 2 to 1.)

Other ideas:
* Use xor r,r instead of mov $0,r (considering that xor messes with the
  carry bit).

* Use one more register for accumulation, with 4x unrolling.  That would
  save the 0xaaaaaaab magic mul.

* Provide variant with mulx.

* Accumulate differently, say 4 consecutive limbs at a time, with carry
  being alive.  That will require more registers for sure.  By using
  adcx and adox, one may accumulate to the same registers in two chains
  semi-simultaneously.

* Use rbx instead of r12 to save a byte or two.

I suspect te present code is far from optimal on modern x86 CPUs which
can sustain 1 64x64->128 multiply per cycle.  I feel confident that we
could reach close to 1 c/l.

-- 
Torbjörn
Please encrypt, key id 0xC8601622