mpn_mul is embarrassingly slow

Marc Glisse marc.glisse at inria.fr
Fri Apr 20 19:25:27 UTC 2018


On Fri, 20 Apr 2018, Marc Glisse wrote:

> On Fri, 20 Apr 2018, Marc Glisse wrote:
>
>> On Fri, 20 Apr 2018, Marc Glisse wrote:
>> 
>>> On Fri, 20 Apr 2018, Vincent Lefevre wrote:
>>> 
>>>> On 2018-04-20 04:14:15 +0200, Fredrik Johansson wrote:
>>>>> For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
>>>>> twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 
>>>>> 2x2
>>>>> multiplication is even faster. The problem is that there are three 
>>>>> function
>>>>> calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
>>>>> user code and GMP's lightning fast assembly code.
>>>>> 
>>>>> I was reminded of this old issue when seeing this new paper on arXiv:
>>>>> https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
>>>>> implementation of bignum arithmetic against mpn_mul for small operand 
>>>>> sizes
>>>>> and came to the conclusion that the former approach performs better than
>>>>> hand-optimized assembly (one wishes that compilers really were that 
>>>>> clever
>>>>> about bignum code by now!).
>>>>> 
>>>>> Some advanced GMP users (including myself) know about the issue and 
>>>>> simply
>>>>> avoid mpn_mul for performance-critical code with short operands. The 
>>>>> most
>>>>> convenient solution is to call mpn_mul_basecase directly instead of
>>>>> mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
>>>>> iffy to rely on. One feature request would be to simply make
>>>>> mpn_mul_basecase / mpn_sqr_basecase public.
>>>> [...]
>>>> 
>>>> I'm wondering... With the current GMP code, does LTO help to avoid
>>>> such issues?
>>> 
>>> mpn_mul and mpn_mul_n are too large to be completely inlined (unless 
>>> that's the only place where they are used, which could happen in a 
>>> microtest, but doesn't seem realistic in an application). What could 
>>> happen is partial inlining of the first test of each. Maybe using LTO+PGO 
>>> (profile-guided optimization)? Still, I am not particularly optimistic.
>> 
>> I just tried (LTO+PGO) on a trivial testcase, and gcc didn't manage to do 
>> anything clever with it. Doing it by hand to see how much potential gain 
>> there is, the timings are:
>> 
>> mpn_mul: .56
>> mpn_mul_n: .36
>> mpn_mul_basecase: .16
>
> Let me mention here that LLVM does manage to take advantage of LTO to get the 
> running time of mpn_mul down to .16 for this trivial example (only call 
> mpn_mul, only on 1 limb numbers).

> Avec la propagation des constantes, GCC ne peut-il pas en déduire que
> tout un tas de choses est du code mort, en tout cas passer directement
> à mpn_mul_basecase (p, a, n, b, n) puisque le test BELOW_THRESHOLD (n,
> MUL_TOOM22_THRESHOLD) est vrai?

I took a closer look at what gcc is doing. Propagating constants (the
sizes) from one function to another (mpn_mul) means either inlining
(unlikely, the function is very large and has several callers including
itself) or cloning, i.e. creating a similar function that takes fewer
arguments and assumes that the few missing are those constants.

I can see in the logs:

Evaluating opportunities for __gmpn_mul/68.
  - considering value &r for param #0 mp_limb_t * (caller_count: 1)
      good_cloning_opportunity_p (time: 2, size: 698, count_sum: 100000000, scc, single_call) -> evaluation: 0, threshold: 500
      good_cloning_opportunity_p (time: 264, size: 59564, count_sum: 100000000, scc, single_call) -> evaluation: 1, threshold: 500
  - considering value &a for param #1 const mp_limb_t * (caller_count: 1)
      good_cloning_opportunity_p (time: 2, size: 698, count_sum: 100000000, scc, single_call) -> evaluation: 0, threshold: 500
      good_cloning_opportunity_p (time: 50, size: 12795, count_sum: 100000000, scc, single_call) -> evaluation: 0, threshold: 500
  - considering value 1 for param #2 mp_size_t (caller_count: 1)
      good_cloning_opportunity_p (time: 75, size: 689, count_sum: 100000000, scc, single_call) -> evaluation: 54, threshold: 500
      good_cloning_opportunity_p (time: 721, size: 4091, count_sum: 100000000, scc, single_call) -> evaluation: 89, threshold: 500
  - considering value &b for param #3 const mp_limb_t * (caller_count: 3)
      good_cloning_opportunity_p (time: 1, size: 699, count_sum: 100000000, scc, single_call) -> evaluation: 0, threshold: 500
      good_cloning_opportunity_p (time: 49, size: 12796, count_sum: 100000000, scc, single_call) -> evaluation: 0, threshold: 500
  - considering value 1 for param #4 mp_size_t (caller_count: 3)
      good_cloning_opportunity_p (time: 72, size: 123, count_sum: 100000000, scc, single_call) -> evaluation: 298, threshold: 500
      good_cloning_opportunity_p (time: 72, size: 123, count_sum: 100000000, scc, single_call) -> evaluation: 298, threshold: 500

So gcc did consider this optimization, but its heuristics mistakenly
convinced it it wasn't worth the trouble. On simpler examples not using
GMP, I can get gcc to do the transformation, it would be quite some work
to analyse why gcc is not more enthusiastic here.

I tried again with --param ipa-cp-eval-threshold=50 to convince gcc to
clone... and the running time went down to .29. It did clone mpn_mul and
mpn_mul_n, and the clones are a couple lines each, but gcc failed to
inline those afterwards! Curiously, it also failed to eliminate the now
dead functions like mpn_mul_fft...

(note that PGO is needed, otherwise gcc will give a score of 0 to every 
cloning opportunity)

-- 
Marc Glisse


More information about the gmp-devel mailing list