Small operands gcd improvements

Thu Aug 8 05:11:19 UTC 2019

tg at gmplib.org (Torbjörn Granlund) writes:

> I'm having problems with timing of the gcd_11 code.  Unfortunately, the
> nested macros of speed.h make things hard to read.  Could yo
> double-check that operands to gcd_11 are odd and full limbs?

I'm fairly sure they are odd; I've ran speed on a --disable-assembly
--enable-assert build. But they're full limbs only for -s 64. This is
what I see, after your asm changes:

$ ./tune/speed -p 100000 -c -s 1-64 mpn_gcd_1 mpn_gcd_11
overhead 5.01 cycles, precision 100000 units of 1.14e-09 secs, CPU freq
876.11 MHz
            mpn_gcd_1    mpn_gcd_11
1               12.05         #8.05
2               13.56         #8.69
3               16.23         #8.88
4               16.67         #9.37
5               20.04        #12.55
6               22.88        #16.06
7               28.21        #21.21
8               32.47        #27.34
9               34.53        #32.17
10              41.24        #38.47
11              43.65        #42.90
12             #49.50         50.26
13             #55.92         58.40
14             #57.88         61.12
15             #66.06         66.68
16             #68.26         71.43
17             #71.36         75.56
18             #77.95         81.72
19             #83.09         86.64
20             #84.53         89.62
21             #86.88         93.23
22             #97.32         98.35
23             101.75       #101.46
24            #107.63        108.32
25            #110.05        110.30
26             118.25       #114.15
27            #115.00        119.02
28            #118.93        123.26
29            #123.22        124.74
30            #127.43        131.10
31            #134.56        139.14
32             139.34       #139.21
33             142.63       #141.76
34            #145.42        147.85
35            #147.92        150.54
36            #151.31        154.07
37            #152.37        159.75
38            #155.48        162.92
39             168.28       #164.72
40            #163.92        169.57
41             177.38       #171.68
42            #183.22        185.21
43            #188.94        189.92
44            #195.21        196.79
45             198.06       #197.09
46            #200.67        201.34
47            #203.37        205.09
48             208.80       #208.16
49            #212.83        216.31
50            #215.69        217.80
51            #219.62        228.48
52             237.99       #228.64
53             222.52       #219.11
54            #219.05        222.15
55            #221.39        227.44
56            #225.41        230.09
57            #226.36        235.82
58            #231.66        242.04
59            #239.70        246.53
60            #239.07        246.82
61            #242.48        249.31
62            #244.55        254.03
63            #249.94        257.11
64            #254.20        262.47

I.e., differences of just a few cycles, but disappointingly not always
in favor of the new gcd_11. If we fix the size of one of the operands,
the difference due to initial reduction in gcd_1 is very visible:

$ ./tune/speed -p 100000 -c -s 1-64 mpn_gcd_1.10 mpn_gcd_11.10
overhead 5.01 cycles, precision 100000 units of 9.00e-10 secs, CPU freq
1111.62 MHz
         mpn_gcd_1.10 mpn_gcd_11.10
1               48.56        #31.68
2               47.65        #30.24
3               48.78        #33.06
4               49.02        #33.15
5               50.00        #31.07
6               48.74        #34.44
7               52.14        #35.24
8               52.51        #37.61
9               46.81        #39.10
10              56.40        #41.97
11              59.66        #42.12
12              62.68        #47.73
13              71.72        #49.63
14              84.10        #54.56
15             102.05        #57.25
16             117.86        #63.12
17             125.77        #58.76
18             120.16        #63.45
19             123.74        #66.41
20             145.42        #82.43
21             143.24        #83.48
22             147.30        #86.79
23             139.39        #81.41
24             126.56        #84.67
25             124.91        #86.97
26             126.57        #88.07
27             149.59       #102.50
28             146.80       #109.59
29             146.96       #103.42
30             140.75       #101.46
31             127.92       #103.95
32             126.41       #108.93
33             126.68       #112.32
34             142.09       #126.03
35             143.65       #130.04
36             142.99       #130.96
37             143.80       #131.02
38             127.91       #126.18
39            #127.76        129.48
40            #128.44        130.90
41             151.79       #142.34
42            #142.43        145.96
43            #142.62        146.64
44            #128.56        145.52
45            #129.30        150.46
46            #129.68        155.30
47            #129.07        159.03
48            #132.37        158.59
49            #126.59        158.04
50            #128.14        161.57
51            #129.74        164.20
52            #127.63        166.14
53            #130.12        170.52
54            #130.41        175.19
55            #133.91        173.54
56            #129.01        176.41
57            #127.91        178.71
58            #126.03        181.28
59            #129.15        185.43
60            #130.57        188.86
61            #130.14        190.29
62            #127.46        194.79
63            #129.22        196.25
64            #127.74        202.10

> Speaking of gcd_22.  We need to determine this function's interface.

I've been considering

void 
mpn_gcd_22 (mp_ptr rp, 
            mp_limb_t u1, mp_limb_t u0, 
            mp_limb_t v1, mp_limb_t v0);

with the requirement that u0 and v0 are odd. Do you prefer something
different? I'm thinking that it should not be a public function.

> The last loop will be 11.  We can simply inline a copy here as it is
> tiny.  (A tail call won't work as the functions will have different
> return types.)

Except if we have lots of tuned variants of gcd_11 and fewer gcd_22;
then gcd_22 ought to call the separately selected gcd_11.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.