mpn_mul is embarrassingly slow

Fri Apr 20 09:36:41 UTC 2018

On Fri, 20 Apr 2018, Vincent Lefevre wrote:

> On 2018-04-20 04:14:15 +0200, Fredrik Johansson wrote:
>> For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
>> twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 2x2
>> multiplication is even faster. The problem is that there are three function
>> calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
>> user code and GMP's lightning fast assembly code.
>>
>> I was reminded of this old issue when seeing this new paper on arXiv:
>> https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
>> implementation of bignum arithmetic against mpn_mul for small operand sizes
>> and came to the conclusion that the former approach performs better than
>> hand-optimized assembly (one wishes that compilers really were that clever
>> about bignum code by now!).
>>
>> Some advanced GMP users (including myself) know about the issue and simply
>> avoid mpn_mul for performance-critical code with short operands. The most
>> convenient solution is to call mpn_mul_basecase directly instead of
>> mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
>> iffy to rely on. One feature request would be to simply make
>> mpn_mul_basecase / mpn_sqr_basecase public.
> [...]
>
> I'm wondering... With the current GMP code, does LTO help to avoid
> such issues?

mpn_mul and mpn_mul_n are too large to be completely inlined (unless 
that's the only place where they are used, which could happen in a 
microtest, but doesn't seem realistic in an application). What could 
happen is partial inlining of the first test of each. Maybe using LTO+PGO 
(profile-guided optimization)? Still, I am not particularly optimistic.

-- 
Marc Glisse