What's a reasonable size ratio for toom32?

Thu Sep 14 19:54:27 CEST 2023

Niels Möller <nisse at lysator.liu.se> writes:

> Nice speedup for some sices, in particular 15. Some notable regressions,
> in particular size 30x24 and 42x21, if I read the table correctly.
>
> To understand what's going on, one might need to separate the rather
> small changes to algorithm selection, to the reduced toom32 overhead.

Since this was so noisy, I've made a new try, taking out (most of) the
mpn_mul algorithm selection from picture. So let's compare
mpn_mul_basecase and mpn_toom32_mul for a few ratios. I tried, 0.5, 0.7
and 0.8.

Before:

$ ./speed-nohack -p 100000 -c -s 2-100 -f 1.2  mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
        mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80
2               15.85           n/a         15.85           n/a        #15.85           n/a
3              #17.83           n/a         24.76           n/a         24.76           n/a
4               28.89           n/a        #28.84           n/a         53.46           n/a
5              #32.85           n/a         61.39           n/a         72.28           n/a
6              #68.32           n/a         81.19           n/a         81.19           n/a
7              #75.27           n/a         90.12           n/a        112.00           n/a
8              #98.07           n/a        122.02           n/a        142.78           n/a
9             #108.10           n/a        159.69           n/a        189.20           n/a
10            #152.87        463.13        211.76        463.23        235.97        463.86
12            #203.81        501.84        270.97        502.65        306.06        502.62
14            #283.98        614.32        364.14        615.33        445.40        614.30
16            #352.81        696.92        488.13        697.71        529.72        699.03
19            #479.76        884.33        691.06        883.58        797.68        883.27
22            #671.82       1034.74        916.42       1034.20       1037.16       1033.89
26            #930.34       1306.73       1279.71       1303.43       1422.11       1303.24
31           #1266.87       1750.73       1814.02       1749.58       2038.85       2281.88
37           #1785.48       2306.52       2519.19       2305.17       2919.34       2304.54
44           #2593.88       3037.51       3531.03       3032.28       4122.22       3035.03
52           #3600.48       4006.00       4994.00       4003.96       5688.95       4001.57
62           #5147.14       5370.00       7127.25       5365.57       8147.50       5363.33
74            7315.00       6952.75      10098.10       6953.31      11673.44      #6949.69
88           10236.73       9273.75      14214.00       9278.67      16295.00      #9265.75

$ ./speed-hack -p 100000 -c -s 2-100 -f 1.2  mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80
overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz
        mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80
2               15.85           n/a         15.84           n/a        #15.84           n/a
3              #17.82           n/a         24.76           n/a         24.76           n/a
4               28.91           n/a        #28.86           n/a         53.45           n/a
5              #32.85           n/a         61.38           n/a         72.26           n/a
6              #68.31           n/a         81.18           n/a         81.18           n/a
7              #75.25           n/a         90.10           n/a        112.05           n/a
8              #98.06           n/a        122.04           n/a        142.86           n/a
9             #108.15           n/a        159.92           n/a        189.21           n/a
10            #152.76        420.43        211.70        420.25        236.25        419.95
12            #203.69        463.06        270.84        464.07        306.00        463.74
14            #283.98        586.27        363.69        586.42        444.86        586.57
16            #352.48        673.81        488.66        673.32        529.77        673.20
19            #479.49        853.71        691.75        853.47        796.95        853.06
22            #671.75        996.61        915.84        997.88       1037.51        995.13
26            #929.26       1294.79       1279.71       1291.70       1421.53       1290.95
31           #1264.93       1717.77       1813.25       1716.57       2038.21       1717.83
37           #1785.33       2276.68       2518.91       2274.46       2918.13       2275.00
44           #2593.69       3005.06       3531.81       3002.97       4121.48       3005.35
52           #3599.65       3977.39       4993.23       3979.29       5687.50       3978.75
62           #5149.77       5268.05       7125.44       5262.10       8149.14       5267.90
74           #7313.47       8835.50      10096.90       8835.88      11668.33       8835.88
88           10232.73      #9199.17      14206.37       9199.42      16287.14      11831.92

It looks like the new toom32 is generally faster than the old (except
for the larger sizes where we should be using higher toom, and then the
old code wins because it recurses to mpn_mul which can call one of those
functions), which is nice. But another observation that I think is
interesting, is that it has a really hard time beating mul_basecase at
ratio 0.5. So we probably should never call toom32 close to this ratio
(on shell, it seems toom42 starts beating basecase around size 55
limbs).

Which brings me somewhat back to the drawing board, where the idea was
that the unbalanced subproduct in toom22 should always recurse to one of
toom22, toom32 or schoolbook. Maybe we need some other alternative.
Could consider toom42, but my gut feeling is that's no use for smallish
sizes (although we don't tune toom32 vs toom42, so I don't know where
the threshold might be).

Perhaps we should have a 2:1 algorithm which splits the larger number in
two, and multiplies each part with the smaller one using toom22. What
would be a name for that? I'm having some difficulty with the intuition,
perhaps it would help to draw the diagrams of ratio for unbalanced sub
product as a function of the input ratio, for both toom22 and toom32, as
well as the new candidate algorithm.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.