mpn_mul is embarrassingly slow

Fri Apr 20 16:07:19 UTC 2018

On Fri, 20 Apr 2018, Marc Glisse wrote:

> On Fri, 20 Apr 2018, Marc Glisse wrote:
>
>> On Fri, 20 Apr 2018, Vincent Lefevre wrote:
>> 
>>> On 2018-04-20 04:14:15 +0200, Fredrik Johansson wrote:
>>>> For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
>>>> twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 
>>>> 2x2
>>>> multiplication is even faster. The problem is that there are three 
>>>> function
>>>> calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
>>>> user code and GMP's lightning fast assembly code.
>>>> 
>>>> I was reminded of this old issue when seeing this new paper on arXiv:
>>>> https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
>>>> implementation of bignum arithmetic against mpn_mul for small operand 
>>>> sizes
>>>> and came to the conclusion that the former approach performs better than
>>>> hand-optimized assembly (one wishes that compilers really were that 
>>>> clever
>>>> about bignum code by now!).
>>>> 
>>>> Some advanced GMP users (including myself) know about the issue and 
>>>> simply
>>>> avoid mpn_mul for performance-critical code with short operands. The most
>>>> convenient solution is to call mpn_mul_basecase directly instead of
>>>> mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
>>>> iffy to rely on. One feature request would be to simply make
>>>> mpn_mul_basecase / mpn_sqr_basecase public.
>>> [...]
>>> 
>>> I'm wondering... With the current GMP code, does LTO help to avoid
>>> such issues?
>> 
>> mpn_mul and mpn_mul_n are too large to be completely inlined (unless that's 
>> the only place where they are used, which could happen in a microtest, but 
>> doesn't seem realistic in an application). What could happen is partial 
>> inlining of the first test of each. Maybe using LTO+PGO (profile-guided 
>> optimization)? Still, I am not particularly optimistic.
>
> I just tried (LTO+PGO) on a trivial testcase, and gcc didn't manage to do 
> anything clever with it. Doing it by hand to see how much potential gain 
> there is, the timings are:
>
> mpn_mul: .56
> mpn_mul_n: .36
> mpn_mul_basecase: .16

Let me mention here that LLVM does manage to take advantage of LTO to get 
the running time of mpn_mul down to .16 for this trivial example (only 
call mpn_mul, only on 1 limb numbers).

-- 
Marc Glisse