What's a reasonable size ratio for toom32?

Thu Aug 24 21:06:20 CEST 2023

Paul Zimmermann <Paul.Zimmermann at inria.fr> writes:

>        Dear Niels,
>
> ./speed -p 1000000 -c -s 10-200 -f1.1 mpn_mul.0.6 would be more readable,
> although the change in speed.h would be larger.

That would be nicer. Ideally, one could use some way to parametrize both
sizes.

Anyway, I tried some benchmarking, on shell, using the simple hack.
Speed of mpn_mul_n appear unchanged (expected, since then all
subproducts are perfectly balanced, so I figured it might make sense to
measure times in relation to mpn_mul_n.

Before the toom22 and toom32 changes:

$ ./speed-before -p 100000 -c -s 2-50 -f 1.1 -r mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
            mpn_mul_n   mpn_mul.-50   mpn_mul.-60   mpn_mul.-70   mpn_mul.-80   mpn_mul.-90
2               25.35       #0.9639        0.9650        0.9651        0.9651        0.9651
3               53.46       #0.4773        0.4778        0.6400        0.6399        0.6399
4               68.32        0.5514       #0.5513        0.5537        0.9129        0.9129
5               95.08       #0.4501        0.7394        0.7394        0.8541        0.8540
6              120.21        0.6426       #0.6426        0.7513        0.7517        0.9146
7              158.69       #0.5305        0.6281        0.6281        0.7679        0.8712
8              196.26        0.5497       #0.5479        0.6679        0.7656        0.9037
9              247.84       #0.4791        0.5918        0.6784        0.8045        0.8902
10             301.55       #0.5264        0.6173        0.7226        0.8082        0.9199
11             358.73       #0.4765        0.5533        0.6517        0.7289        0.8301
12             410.29       #0.5157        0.6070        0.6783        0.7696        0.8426
13             489.66       #0.4637        0.5524        0.7048        0.7688        0.8569
14             566.06       #0.5116        0.5748        0.6551        0.7953        0.8601
15             645.85       #0.4751        0.6098        0.6694        0.7994        0.8739
16             712.23       #0.5065        0.5746        0.6984        0.7558        0.8786
17             814.82       #0.4712        0.5884        0.6537        0.7705        0.8875
18             912.66       #0.5096        0.5582        0.6688        0.7803        0.8907
19            1016.32       #0.4793        0.5834        0.6890        0.7925        0.8979
20            1052.75       #0.5241        0.6330        0.7306        0.8359        0.9406
22            1240.61       #0.5480        0.6460        0.7451        0.8428        0.9548
24            1408.13       #0.5577        0.6504        0.7434        0.8871        0.9590
26            1663.78       #0.5625        0.6484        0.7736        0.9174        0.9595
28            1897.37       #0.5592        0.6375        0.7624        0.9014        0.9436
30            2179.16       #0.5671        0.6777        0.8173        0.8539        0.9376
33            2569.17       #0.5558        0.6637        0.7667        0.8724        0.9450
36            2972.70       #0.5841        0.6830        0.7767        0.8368        0.9353
39            3389.81       #0.6078        0.6968        0.7666        0.8819        0.9416
42            3834.29        0.7139       #0.6918        0.7620        0.8818        0.9348
46            4333.80       #0.7084        0.7221        0.7607        0.8906        0.9739
50            5053.64       #0.6932        0.7102        0.7972        0.8792        0.9830

It makes sense that speed ratios roughly match the size ration in the
middle of the table where basecase is a large part of the work, and
increase with larer sizes. After the changes, I get these values, with
some annotations for notable differences.

$ ./speed -p 100000 -c -s 2-50 -f 1.1 -r mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
            mpn_mul_n   mpn_mul.-50   mpn_mul.-60   mpn_mul.-70   mpn_mul.-80   mpn_mul.-90
2               25.25        0.9655        0.9650        0.9657       #0.9649        0.9650
3               53.46        0.4774       #0.4773        0.6409        0.6400        0.6406
4               68.32        0.5525        0.5580       #0.5524        0.9130        0.9129
5               95.08       #0.4500        0.7395        0.7396        0.8541        0.8540
6              119.99       #0.6437        0.6437        0.7524        0.7517        0.9163
7              158.37       #0.5316        0.6294        0.6297        0.7695        0.8700
8              193.54       #0.5591        0.5593        0.6772        0.7832        0.9133 (slightly worse, noise?)
9              248.64       #0.4791        0.5914        0.6752        0.8064        0.8861
10             300.28       #0.5272        0.6210        0.7265        0.8118        0.9237
11             358.70       #0.4764        0.5527        0.6545        0.7301        0.8321
12             410.21       #0.5161        0.6066        0.6785        0.7684        0.8442
13             488.42       #0.4631        0.5571        0.7069        0.7694        0.8590
14             564.68       #0.5137        0.5756        0.6560        0.7985        0.8645
15             688.37       #0.4472        0.5722        0.6247        0.7481        0.8184 (speedup!)
16             708.63       #0.5086        0.5779        0.7025        0.7608        0.8847 (slightly worse)
17             813.77       #0.4720        0.5888        0.6568        0.7724        0.8898
18             910.40       #0.5103        0.5615        0.6707        0.7826        0.8909
19            1015.09       #0.4790        0.5843        0.6898        0.7950        0.8985
20            1052.33       #0.5252        0.6300        0.7323        0.8351        0.9415
22            1238.14       #0.5495        0.6478        0.7463        0.8450        0.9571
24            1400.00       #0.5614        0.6545        0.7487        0.8922        0.9403
26            1665.52       #0.5619        0.6477        0.7741        0.8799 (!)    0.9345 (!) speedup
28            1897.93       #0.5600        0.6374        0.7621        0.8764 (!)    0.9372 speedup
30            2170.58       #0.5688        0.6802        0.8074        0.8744 (!)    0.9352 slowdown
33            2568.48       #0.5558        0.6644        0.7597        0.8814        0.9342 slight speedup
36            2970.97       #0.5847        0.6706        0.7705        0.8285        0.9327
39            3392.28       #0.6087        0.6868        0.7619        0.8864        0.9334
42            3830.21        0.8785 (!)   #0.6857        0.7569        0.8734        0.9303 Slowdown for 0.5!
46            4345.72        0.7127       #0.7092        0.7514        0.8743        0.9665
50            5062.55       #0.6920        0.7027        0.7868        0.8891        0.9602

Nice speedup for some sices, in particular 15. Some notable regressions,
in particular size 30x24 and 42x21, if I read the table correctly.

To understand what's going on, one might need to separate the rather
small changes to algorithm selection, to the reduced toom32 overhead.

But overall, rather small changes.

Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.