What's a reasonable size ratio for toom32?
Niels Möller
nisse at lysator.liu.se
Thu Aug 24 21:06:20 CEST 2023
Paul Zimmermann <Paul.Zimmermann at inria.fr> writes:
> Dear Niels,
>
> ./speed -p 1000000 -c -s 10-200 -f1.1 mpn_mul.0.6 would be more readable,
> although the change in speed.h would be larger.
That would be nicer. Ideally, one could use some way to parametrize both
sizes.
Anyway, I tried some benchmarking, on shell, using the simple hack.
Speed of mpn_mul_n appear unchanged (expected, since then all
subproducts are perfectly balanced, so I figured it might make sense to
measure times in relation to mpn_mul_n.
Before the toom22 and toom32 changes:
$ ./speed-before -p 100000 -c -s 2-50 -f 1.1 -r mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
2 25.35 #0.9639 0.9650 0.9651 0.9651 0.9651
3 53.46 #0.4773 0.4778 0.6400 0.6399 0.6399
4 68.32 0.5514 #0.5513 0.5537 0.9129 0.9129
5 95.08 #0.4501 0.7394 0.7394 0.8541 0.8540
6 120.21 0.6426 #0.6426 0.7513 0.7517 0.9146
7 158.69 #0.5305 0.6281 0.6281 0.7679 0.8712
8 196.26 0.5497 #0.5479 0.6679 0.7656 0.9037
9 247.84 #0.4791 0.5918 0.6784 0.8045 0.8902
10 301.55 #0.5264 0.6173 0.7226 0.8082 0.9199
11 358.73 #0.4765 0.5533 0.6517 0.7289 0.8301
12 410.29 #0.5157 0.6070 0.6783 0.7696 0.8426
13 489.66 #0.4637 0.5524 0.7048 0.7688 0.8569
14 566.06 #0.5116 0.5748 0.6551 0.7953 0.8601
15 645.85 #0.4751 0.6098 0.6694 0.7994 0.8739
16 712.23 #0.5065 0.5746 0.6984 0.7558 0.8786
17 814.82 #0.4712 0.5884 0.6537 0.7705 0.8875
18 912.66 #0.5096 0.5582 0.6688 0.7803 0.8907
19 1016.32 #0.4793 0.5834 0.6890 0.7925 0.8979
20 1052.75 #0.5241 0.6330 0.7306 0.8359 0.9406
22 1240.61 #0.5480 0.6460 0.7451 0.8428 0.9548
24 1408.13 #0.5577 0.6504 0.7434 0.8871 0.9590
26 1663.78 #0.5625 0.6484 0.7736 0.9174 0.9595
28 1897.37 #0.5592 0.6375 0.7624 0.9014 0.9436
30 2179.16 #0.5671 0.6777 0.8173 0.8539 0.9376
33 2569.17 #0.5558 0.6637 0.7667 0.8724 0.9450
36 2972.70 #0.5841 0.6830 0.7767 0.8368 0.9353
39 3389.81 #0.6078 0.6968 0.7666 0.8819 0.9416
42 3834.29 0.7139 #0.6918 0.7620 0.8818 0.9348
46 4333.80 #0.7084 0.7221 0.7607 0.8906 0.9739
50 5053.64 #0.6932 0.7102 0.7972 0.8792 0.9830
It makes sense that speed ratios roughly match the size ration in the
middle of the table where basecase is a large part of the work, and
increase with larer sizes. After the changes, I get these values, with
some annotations for notable differences.
$ ./speed -p 100000 -c -s 2-50 -f 1.1 -r mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
mpn_mul_n mpn_mul.-50 mpn_mul.-60 mpn_mul.-70 mpn_mul.-80 mpn_mul.-90
2 25.25 0.9655 0.9650 0.9657 #0.9649 0.9650
3 53.46 0.4774 #0.4773 0.6409 0.6400 0.6406
4 68.32 0.5525 0.5580 #0.5524 0.9130 0.9129
5 95.08 #0.4500 0.7395 0.7396 0.8541 0.8540
6 119.99 #0.6437 0.6437 0.7524 0.7517 0.9163
7 158.37 #0.5316 0.6294 0.6297 0.7695 0.8700
8 193.54 #0.5591 0.5593 0.6772 0.7832 0.9133 (slightly worse, noise?)
9 248.64 #0.4791 0.5914 0.6752 0.8064 0.8861
10 300.28 #0.5272 0.6210 0.7265 0.8118 0.9237
11 358.70 #0.4764 0.5527 0.6545 0.7301 0.8321
12 410.21 #0.5161 0.6066 0.6785 0.7684 0.8442
13 488.42 #0.4631 0.5571 0.7069 0.7694 0.8590
14 564.68 #0.5137 0.5756 0.6560 0.7985 0.8645
15 688.37 #0.4472 0.5722 0.6247 0.7481 0.8184 (speedup!)
16 708.63 #0.5086 0.5779 0.7025 0.7608 0.8847 (slightly worse)
17 813.77 #0.4720 0.5888 0.6568 0.7724 0.8898
18 910.40 #0.5103 0.5615 0.6707 0.7826 0.8909
19 1015.09 #0.4790 0.5843 0.6898 0.7950 0.8985
20 1052.33 #0.5252 0.6300 0.7323 0.8351 0.9415
22 1238.14 #0.5495 0.6478 0.7463 0.8450 0.9571
24 1400.00 #0.5614 0.6545 0.7487 0.8922 0.9403
26 1665.52 #0.5619 0.6477 0.7741 0.8799 (!) 0.9345 (!) speedup
28 1897.93 #0.5600 0.6374 0.7621 0.8764 (!) 0.9372 speedup
30 2170.58 #0.5688 0.6802 0.8074 0.8744 (!) 0.9352 slowdown
33 2568.48 #0.5558 0.6644 0.7597 0.8814 0.9342 slight speedup
36 2970.97 #0.5847 0.6706 0.7705 0.8285 0.9327
39 3392.28 #0.6087 0.6868 0.7619 0.8864 0.9334
42 3830.21 0.8785 (!) #0.6857 0.7569 0.8734 0.9303 Slowdown for 0.5!
46 4345.72 0.7127 #0.7092 0.7514 0.8743 0.9665
50 5062.55 #0.6920 0.7027 0.7868 0.8891 0.9602
Nice speedup for some sices, in particular 15. Some notable regressions,
in particular size 30x24 and 42x21, if I read the table correctly.
To understand what's going on, one might need to separate the rather
small changes to algorithm selection, to the reduced toom32 overhead.
But overall, rather small changes.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel
mailing list