mpn_mul is embarrassingly slow

Fri Apr 20 02:14:15 UTC 2018

For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 2x2
multiplication is even faster. The problem is that there are three function
calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
user code and GMP's lightning fast assembly code.

I was reminded of this old issue when seeing this new paper on arXiv:
https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
implementation of bignum arithmetic against mpn_mul for small operand sizes
and came to the conclusion that the former approach performs better than
hand-optimized assembly (one wishes that compilers really were that clever
about bignum code by now!).

Some advanced GMP users (including myself) know about the issue and simply
avoid mpn_mul for performance-critical code with short operands. The most
convenient solution is to call mpn_mul_basecase directly instead of
mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
iffy to rely on. One feature request would be to simply make
mpn_mul_basecase / mpn_sqr_basecase public.

It should also be possible to speed up mpn_mul. A possible optimization
would be to call mpn_mul_basecase directly in mpn_mul without taking the
detour through mpn_mul_n for small balanced multiplications. A more
aggressive optimization would be to change the architecture of mpn_mul so
that assembly code is reached directly. I believe at least some of the
assembly implementations of mpn_mul_basecase already do something like this:

case 1x1: ...
case 2x1: ...
case 2x2: ...
generic case: (basecase loop)

It would be possible to have mpn_mul itself assembly-coded to do something
like this:

case 1x1: ...
case 2x1: ...
case 2x2: ...
generic case, small n: (basecase loop)
generic case, large n: (fall back to calling an mpn_mul_generic function
that selects between different algorithms)

An even more ambitious change would be to provide some kind of interface in
GMP for arithmetic with minimal overhead when the number of limbs is a
small compile-time constant. For example, such an interface could consist
of macros of the type mpn_op_nxm for op = add/sub, mul and addmul/submul
and different small (n,m), Of course, such macros could simply call the
normal mpn functions as a generic fallback if an optimized assembly block
for that size is not available. Another possibility for an interface would
be to detect compile-time constants in the function arguments to the usual
mpn functions (with the help of macros), but that might be too much magic...

Best,
Fredrik