mpn_mul is embarrassingly slow

Wed Apr 25 18:10:10 UTC 2018

tg at gmplib.org (Torbjörn Granlund) writes:

> nisse at lysator.liu.se (Niels Möller) writes:
>
>   For n x 20, n pretty large, what strategy does mpn_mul use? I would
>   expect repeated toom32, but maybe that gives too much overhed. It seems
>   we don't have any MUL_TOOM32_TO_BASECASE threshold?
>
> I believe this is the logic triggered in this case:
>
> 	  if (4 * un < 5 * vn)
> 	    mpn_toom22_mul (prodp, up, un, vp, vn, scratch);
> 	  else if (4 * un < 7 * vn)
> 	    mpn_toom32_mul (prodp, up, un, vp, vn, scratch);
> 	  else
> 	    mpn_toom42_mul (prodp, up, un, vp, vn, scratch);

On my machine (coreibwl), MUL_TOOM22_THRESHOLD is 27.

If we benchmark mpn_mul_basecase against mpn_toom32_mul sing speed, the
relevant crossover is where the ratio is close to 2/3 (since the size
argument for speed measurements of toom32 is the size of the larger
operand). I get

$ ./tune/speed -p 10000 -r -s 20-50 mpn_mul_basecase mpn_toom32_mul
overhead 0.000000005 secs, precision 10000 units of 1.25e-09 secs, CPU
freq 798.34 MHz
        mpn_mul_basecase mpn_toom32_mul
[...]
28        0.000001875       #0.7651
29        0.000002008       #0.7434
30        0.000002140       #0.7094
31        0.000002262       #0.7525
32        0.000002452       #0.7030
33        0.000002600       #0.6905
34        0.000002690       #0.7143
35        0.000002898       #0.6693
36        0.000003024       #0.6575
37        0.000003208       #0.6714
38        0.000003376       #0.6508
39        0.000003515       #0.6446
40        0.000003723       #0.6422
41        0.000003917       #0.6343
42        0.000004000       #0.6245
43        0.000004282       #0.6178
[...]

So the threshold measured this way would be around 38 limbs. Translated
to vn, the size of the smaller operand, the threshold becomes 25 limbs,
very close to the toom22 threshold.

Similarly comparing mpn_mul_basecase to mpn_toom42_mul, the relevant
crossover is where the ratio is close to 1/2. My measurements look very
flat, but threshold should be somewhere in the range 67-79 limbs for the
large operand, hence in the range 33-40 limbs in terms of the size of
the smaller operand, 20%-50% above the toom22 threshold.

So from this initial experiment, it seems it might work fine to use the
same threshold for toom22 and toom32, if applied to the size of the
smaller operand. But we should tune a separate threshold for toom42 vs
basecase, and never apply toom42 for unbalanced operands with vn below
that threshold.

Which maybe isn't that surprising: toom32 just adds +1 to the set of of
points, {0, -1, inf}, used by toom22, with little additional complexity.
While toom42 adds the interpolation point +2, which adds quite a bit of
complexity, including more shifts and one call to mpn_divexact_by3.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.