gcd_22
Torbjörn Granlund
tg at gmplib.org
Mon Aug 26 09:16:07 UTC 2019
"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:
If I'm not reading this timings wrongly, this means that with the current
code (disregarding the overhead, for those 64-bits limbs)
the bits in the limb 1 require 4 cycles each;
the bits in the limb 2 require 8 cycles each;
the bits in the limb 3 require 54 cycles each;
the bits in the limb 4 require 33 cycles each...
Timing is confusing, and I cannot tell if you're right. Perhaps Niels
can?!
For good measure, I write a gcd_33 in the style of gcd_22. It runs
about 20% slower than gcd_22 on AMD Ryzen, so really well! The code
runs on Intel Haswell and later and on AMD Excavator and later.
Attached below.
I agree with Niels that we should optimise hgcd2 and its div1 and div2
callees. They are shock full of unpredicable branches. But it might
also make sense to provide a set of gcd_kk for small k, as gcd is an
important operation which is also very slow compared to most other GMP
operations.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x64-zny-gcd_33.asm
Type: application/octet-stream
Size: 4321 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190826/8d968048/attachment.obj>
-------------- next part --------------
--
Torbj?rn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list