gcd_22

Mon Aug 26 09:16:07 UTC 2019

"Marco Bodrato" <bodrato at mail.dm.unipi.it> writes:

  If I'm not reading this timings wrongly, this means that with the current
  code (disregarding the overhead, for those 64-bits limbs)
  the bits in the limb 1 require 4 cycles each;
  the bits in the limb 2 require 8 cycles each;
  the bits in the limb 3 require 54 cycles each;
  the bits in the limb 4 require 33 cycles each...

Timing is confusing, and I cannot tell if you're right.  Perhaps Niels
can?!

For good measure, I write a gcd_33 in the style of gcd_22.  It runs
about 20% slower than gcd_22 on AMD Ryzen, so really well!  The code
runs on Intel Haswell and later and on AMD Excavator and later.
Attached below.

I agree with Niels that we should optimise hgcd2 and its div1 and div2
callees.  They are shock full of unpredicable branches.  But it might
also make sense to provide a set of gcd_kk for small k, as gcd is an
important operation which is also very slow compared to most other GMP
operations.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: x64-zny-gcd_33.asm
Type: application/octet-stream
Size: 4321 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20190826/8d968048/attachment.obj>
-------------- next part --------------

-- 
Torbj?rn
Please encrypt, key id 0xC8601622