Please update addaddmul_1msb0.asm to support ABI in mingw64
tg at gmplib.org
Wed Oct 6 09:50:19 UTC 2021
I haven't followed this discussion very closely, and did not see if you
have conidered the following.
OK, so the code is 3-ways unrolled. That's always a bit inconvenient
and tends to cause some code bloat.
I am pretty sure we have that at least in sme other place, but still
make all the work in one loop, switching into apropriate places from the
It is not expensice to compute something like
(3^(-1) mod 2^32)*n mod 2^32 / 2^30
in the feed-in code. (3^(-1) mod 2^32) = 0xaaaaaaab, so we can do the
above with two instructions (imul and shr). The latency of umul+shr is
<= 4 on moderna architectures.
Since addaddmul_1msb0 is strictly internal, and since it presumably is
used for very limited values of n, I assume 32-bit arithmetic on n is
(Note that the tricky mod computation above "maps" the remainder 1 to 2
and the remainder 2 to 1.)
* Use xor r,r instead of mov $0,r (considering that xor messes with the
* Use one more register for accumulation, with 4x unrolling. That would
save the 0xaaaaaaab magic mul.
* Provide variant with mulx.
* Accumulate differently, say 4 consecutive limbs at a time, with carry
being alive. That will require more registers for sure. By using
adcx and adox, one may accumulate to the same registers in two chains
* Use rbx instead of r12 to save a byte or two.
I suspect te present code is far from optimal on modern x86 CPUs which
can sustain 1 64x64->128 multiply per cycle. I feel confident that we
could reach close to 1 c/l.
Please encrypt, key id 0xC8601622
More information about the gmp-devel