gcd_22
Marco Bodrato
bodrato at mail.dm.unipi.it
Sun Aug 25 14:33:18 UTC 2019
Ciao,
Il Dom, 25 Agosto 2019 9:34 am, Niels Möller ha scritto:
> tg at gmplib.org (Torbjörn Granlund) writes:
>> Should we have gcd_33, gcd_44, and gcd_55 also?
> I think it would be more beneficial to optimize hgcd2.
> E.g., 3-limb gcd should typically boil down to one hgcd2, a couple of
> 3:1 mul_1/addmul_1, and one call to gcd_22. It's not obvious that a
> gcd_33 using the binary algorithm will be faster.
On shell I get:
$ tune/speed -CD -s16-64 -t48 mpn_gcd_11 mpn_gcd_22
overhead 6.84 cycles, precision 10000 units of 2.86e-10 secs, CPU freq
3500.09 MHz
mpn_gcd_11 mpn_gcd_22
16 (#78.3223) (134.4004)
64 #4.0877 7.8639
$ tune/speed -CD -s1-4 mpn_gcd
overhead 6.83 cycles, precision 10000 units of 2.86e-10 secs, CPU freq
3500.09 MHz
mpn_gcd
1 (350.7188)
2 519.4961
3 3465.1028
4 2168.9167
If I'm not reading this timings wrongly, this means that with the current
code (disregarding the overhead, for those 64-bits limbs)
the bits in the limb 1 require 4 cycles each;
the bits in the limb 2 require 8 cycles each;
the bits in the limb 3 require 54 cycles each;
the bits in the limb 4 require 33 cycles each...
On the same machine, with ABI=32, gcd_22.c is used:
$ tune/speed -CD -s8-32 -t24 mpn_gcd_11 mpn_gcd_22
overhead 6.86 cycles, precision 10000 units of 2.86e-10 secs, CPU freq
3500.09 MHz
mpn_gcd_11 mpn_gcd_22
8 (#45.9492) (126.3672)
32 #4.5485 17.2290
$ tune/speed -CD -s1-4 mpn_gcd
overhead 6.85 cycles, precision 10000 units of 2.86e-10 secs, CPU freq
3500.09 MHz
mpn_gcd
1 (270.6172)
2 538.6328
3 1797.6500
4 1258.4594
the bits in the limb 1 require 4.5 cycles each;
the bits in the limb 2 require 17 cycles each;
the bits in the limb 3 require 56 cycles each;
the bits in the limb 4 require 39 cycles each...
It really seem that the _33 case need some work.
Ĝis,
m
--
http://bodrato.it/papers/
More information about the gmp-devel
mailing list