New thresholds in table

Thu Nov 17 20:32:53 CET 2011

Torbjorn Granlund <tg at gmplib.org> writes:

> When you write that the threshold is "bogus", what do you mean?

That the threshold is below the point where the dc algorithm starts to
win consistently.

> I am not familiar with the HGCD algorithms and code, but a scenario
> where an algorithm might be faster in disjoint ranges makes perfect
> sense.

That may be the case here. That's certainly unexpected to me.

For example, on bobcat tuneup gave a threshold of 31 limbs. I put this
threshold in gmp-mparam.h, and run

  ./speed -o cycles-broken -s25-100 -r mpn_hgcd_appr mpn_hgcd_appr_lehmer

(where mpn_hgcd_appr_lehmer is the same function, but with the threshold
set to oo). I get something like 

25        0.000021918       #0.9804
26       #0.000020254        1.0175
27        0.000024062       #0.9956
28        0.000023719       #0.9872
29        0.000026115       #0.9971
30       #0.000026410        1.0039
31        0.000029227       #0.9943
32       #0.000027456        1.0476
33       #0.000029724        1.0354
34        0.000033176       #0.9378
35        0.000034543       #0.9936
36        0.000034690       #0.9491
37       #0.000037370        1.0042
38       #0.000037307        1.0092
39       #0.000039375        1.0244
40       #0.000040528        1.0117
41       #0.000041027        1.0265
42        0.000045572       #0.9753
43       #0.000044023        1.0768
44        0.000045668       #0.9963
45       #0.000047275        1.0406
46       #0.000049069        1.0149
47       #0.000053588        1.0192
48        0.000054195       #0.9471
49       #0.000056460        1.0118
50       #0.000056100        1.0346
51       #0.000059280        1.0131
52       #0.000058770        1.0466
53       #0.000063103        1.0373
54       #0.000063480        1.0062
55       #0.000065240        1.0332
56       #0.000064426        1.0037
57       #0.000069758        1.0178
58        0.000070777       #0.9878
59       #0.000074418        1.0078
60       #0.000073808        1.0239
61       #0.000077287        1.0124
62        0.000079399       #0.9662
63        0.000084664       #0.9817
64        0.000085752       #0.9756
65        0.000087833       #0.9932
66        0.000088593       #0.9793
67        0.000092872       #0.9784
68        0.000091987       #0.9981
69        0.000098925       #0.9764
70        0.000099583       #0.9597
71        0.000101520       #0.9867
72        0.000101021       #0.9679
73        0.000108534       #0.9441
74        0.000108132       #0.9512
75        0.000109502       #0.9967
76        0.000109638       #0.9915
77        0.000114415       #0.9836
78       #0.000110997        1.0116
79        0.000118131       #0.9697
80        0.000118676       #0.9825
81        0.000124098       #0.9617
82        0.000122920       #0.9966
83        0.000126791       #0.9833
84        0.000128606       #0.9782
85        0.000134989       #0.9752
86        0.000132904       #0.9922
87        0.000136080       #0.9926
88        0.000134670       #0.9931
89        0.000142504       #0.9736
90        0.000141022       #0.9938
91        0.000146522       #0.9940
92       #0.000146383        1.0051
93       #0.000152076        1.0052
94        0.000151755       #0.9782
95       #0.000152659        1.0243
96        0.000153986       #0.9935
97        0.000159660       #0.9930
98       #0.000159568        1.0079
99        0.000165184       #0.9948
100      #0.000164181        1.0217

So we then get a range just above the threshold, 37-61, where the dc
algorithm wins most of the time, by up to \approx 5% (and that's what
tuneup saw, I guess), followed by a range (62-91) where the quadratic
algorithm wins most of the time, by up to 5%, followed by a new cross
over region.

> I suspect Barrett's algorithm could supplant plain schoolbook division
> for timny operands, by using well-written mulhi and mullo.

There's nothing similar here. Both methods get to the same hgcd2 calls
to get single-limb matrices. The quadratic algorithms applies these one
of a time to reduce the inputs and build up the output matrix (and
dropping low limbs of the input as they become uninteresting). While the
dc algorithm builds two matrices of half size, and multiplies them
together (and also multiplies the first of them with the inputs).

And hgcd input size needs to be above 4 times the TOOM22_THRESHOLD (15
here) for those multiplications to be subquadratic. Or above 4 times the
the MATRIX22_STRASSEN_THRESHOLD (14 here) for that trick to pay off.

I really don't understand why we appear to have disjoint ranges where dc
wins.

And it may be the same with mpn_hgcd (not _appr).

Any suggestion? Should we implement disjoint ranges? Or if we use a
single threshold, where should we put it?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.