mpn_mul is embarrassingly slow

Fri Apr 20 08:56:41 UTC 2018

On 2018-04-20 04:14:15 +0200, Fredrik Johansson wrote:
> For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
> twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 2x2
> multiplication is even faster. The problem is that there are three function
> calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
> user code and GMP's lightning fast assembly code.
> 
> I was reminded of this old issue when seeing this new paper on arXiv:
> https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
> implementation of bignum arithmetic against mpn_mul for small operand sizes
> and came to the conclusion that the former approach performs better than
> hand-optimized assembly (one wishes that compilers really were that clever
> about bignum code by now!).
> 
> Some advanced GMP users (including myself) know about the issue and simply
> avoid mpn_mul for performance-critical code with short operands. The most
> convenient solution is to call mpn_mul_basecase directly instead of
> mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
> iffy to rely on. One feature request would be to simply make
> mpn_mul_basecase / mpn_sqr_basecase public.
[...]

I'm wondering... With the current GMP code, does LTO help to avoid
such issues?

-- 
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)