mpn_mul is embarrassingly slow

Fri Apr 20 19:12:42 UTC 2018

Interesting.  I see that this paper compares to NTL as well.
I spent the morning seeing what I could do to improve the situation
for NTL, whose mul routine has essentially the same functionality
as mpz_mul (takes care of memory allocation and signs).
I reduced some overheads and for small inputs just called this function:

   static inline limb_t
   base_mul (_ntl_limb_t* rp, const limb_t* up, long un, const limb_t* vp, long vn)
   {
     rp[un] = mpn_mul_1 (rp, up, un, vp[0]);

     while (--vn >= 1)
       {
         rp += 1, vp += 1;
         rp[un] = mpn_addmul_1 (rp, up, un, vp[0]);
       }
     return rp[un];
   }

At least this has the virtue of using only public GMP functions,
and avoids one level of function call.

This gave me a bit of a speedup speed up (1.35-1.40x speedup for 2-3 limbs).
Beyond 4 limbs, I don't get much of an improvement at all, so I 
cut it off at 4.
I also special case single limb inputs (getting a 2.5x speedup for those).

Within my mul routine, I also detect if the two inputs point to the
same bigint.  If so, I directly call man_sqr, rather than going through
mpn_mul and letting it make that test (but I also special case one limb
inputs and inline that).
This also speeds things up nicely for small inputs (up to 4 limbs).
Looking at the current code for mpn_sqr, it looks like it branches
immediately to mpn_sqr_basecase for small inputs.

Anyway, that's my 

> On Apr 19, 2018, at 10:14 PM, Fredrik Johansson <fredrik.johansson at gmail.com> wrote:
> 
> For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
> twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 2x2
> multiplication is even faster. The problem is that there are three function
> calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
> user code and GMP's lightning fast assembly code.
> 
> I was reminded of this old issue when seeing this new paper on arXiv:
> https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
> implementation of bignum arithmetic against mpn_mul for small operand sizes
> and came to the conclusion that the former approach performs better than
> hand-optimized assembly (one wishes that compilers really were that clever
> about bignum code by now!).
> 
> Some advanced GMP users (including myself) know about the issue and simply
> avoid mpn_mul for performance-critical code with short operands. The most
> convenient solution is to call mpn_mul_basecase directly instead of
> mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
> iffy to rely on. One feature request would be to simply make
> mpn_mul_basecase / mpn_sqr_basecase public.
> 
> It should also be possible to speed up mpn_mul. A possible optimization
> would be to call mpn_mul_basecase directly in mpn_mul without taking the
> detour through mpn_mul_n for small balanced multiplications. A more
> aggressive optimization would be to change the architecture of mpn_mul so
> that assembly code is reached directly. I believe at least some of the
> assembly implementations of mpn_mul_basecase already do something like this:
> 
> case 1x1: ...
> case 2x1: ...
> case 2x2: ...
> generic case: (basecase loop)
> 
> It would be possible to have mpn_mul itself assembly-coded to do something
> like this:
> 
> case 1x1: ...
> case 2x1: ...
> case 2x2: ...
> generic case, small n: (basecase loop)
> generic case, large n: (fall back to calling an mpn_mul_generic function
> that selects between different algorithms)
> 
> An even more ambitious change would be to provide some kind of interface in
> GMP for arithmetic with minimal overhead when the number of limbs is a
> small compile-time constant. For example, such an interface could consist
> of macros of the type mpn_op_nxm for op = add/sub, mul and addmul/submul
> and different small (n,m), Of course, such macros could simply call the
> normal mpn functions as a generic fallback if an optimized assembly block
> for that size is not available. Another possibility for an interface would
> be to detect compile-time constants in the function arguments to the usual
> mpn functions (with the help of macros), but that might be too much magic...
> 
> Best,
> Fredrik
> _______________________________________________
> gmp-devel mailing list
> gmp-devel at gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel