Please update addaddmul_1msb0.asm to support ABI in mingw64
Torbjörn Granlund
tg at gmplib.org
Wed Oct 6 09:50:19 UTC 2021
I haven't followed this discussion very closely, and did not see if you
have conidered the following.
OK, so the code is 3-ways unrolled. That's always a bit inconvenient
and tends to cause some code bloat.
I am pretty sure we have that at least in sme other place, but still
make all the work in one loop, switching into apropriate places from the
feed-in code.
It is not expensice to compute something like
(3^(-1) mod 2^32)*n mod 2^32 / 2^30
in the feed-in code. (3^(-1) mod 2^32) = 0xaaaaaaab, so we can do the
above with two instructions (imul and shr). The latency of umul+shr is
<= 4 on moderna architectures.
Since addaddmul_1msb0 is strictly internal, and since it presumably is
used for very limited values of n, I assume 32-bit arithmetic on n is
suffficient.
(Note that the tricky mod computation above "maps" the remainder 1 to 2
and the remainder 2 to 1.)
Other ideas:
* Use xor r,r instead of mov $0,r (considering that xor messes with the
carry bit).
* Use one more register for accumulation, with 4x unrolling. That would
save the 0xaaaaaaab magic mul.
* Provide variant with mulx.
* Accumulate differently, say 4 consecutive limbs at a time, with carry
being alive. That will require more registers for sure. By using
adcx and adox, one may accumulate to the same registers in two chains
semi-simultaneously.
* Use rbx instead of r12 to save a byte or two.
I suspect te present code is far from optimal on modern x86 CPUs which
can sustain 1 64x64->128 multiply per cycle. I feel confident that we
could reach close to 1 c/l.
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list