Please update addaddmul_1msb0.asm to support ABI in mingw64

Niels Möller nisse at lysator.liu.se
Fri Oct 8 14:56:24 UTC 2021


Torbjörn Granlund <tg at gmplib.org> writes:

> Your version is faster than my versions (where I tested them).
>
> I made some minor changes to your code.

Nice!

Made a few additional tweaks, and tried to get it in the right place.
Attaching a patch that adds the file under coreibwl, as suggested, with
an include_mpn from zen. Took out the undefine hack (which I added
because first attempt at using the mulx macros failed), and copied the
windows support stuff from the old implementation. Complete patch
attached, does it look good enough? It now gives appr 30% speedup
compared to mul_1 + addmul_1 on my machine. I'd prefer to next have a
look at addsubmul_1msb0, before trying to optimize this loop further.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: addaddmul_1msb0-mulx.patch
Type: text/x-diff
Size: 4565 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20211008/12fb99ee/attachment.bin>
-------------- next part --------------

> dnl  AMD64 mpn_addsubmul_1msb0, R = Au - Bv, u,v < 2^63.

This comment obviously wrong ;-)

But that function could be implemented by adding two "not %rdx" in the
right places of the loop, plus small adjustment just before and after
the loop.

Since

 Au - Bv = Au + (2^{64 n} - 1 - B) v - 2^{64 n} v + v

So complement B on the fly, set initial carry limb to v, and subtract v from
the return value. (Same trick as in arm/v7a/cora15/submul_1).

Should definitely be worth a try, before trying some completely
different loop.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.


More information about the gmp-devel mailing list