Performance mpz_divexact / mpz_divexact_ui

Sun Jan 16 19:55:49 CET 2022

While bench-marking an application I was presented with a tight loop 
that does division by 10 until this isn't possible any more.

Scenario:

To reduce the number of iterations done in the necessary loop, the code 
works with multiple loops using powers of 10. The biggest one uses the 
following code, with the numbers of calls and cpu time annotated 
according to callgrind:

   while ((scale - 18) >= min_exp
     && mpz_divisible_ui_p (value, 1000000000000000000UL)
     // 35% within the function call above, done 20071593 times
     && mpz_cmpabs (value, max_value) > 0) {
  	mpz_divexact_ui (value, value, 1000000000000000000UL);
         // 54% within the function call above, done 2059434 times
	scale -= 18;
   }

The loops that are following in the code handle the scales 13,9,6,3 - 
they are are also called often but not as often as the first one (in 
this benchmark) and take less then 0.01% cpu-time altogether.

Things tested to decrease the run-time:

As there is a pre-computed mpz_t with all the constants used in the 
program already, I've did a test replacing the _ui functions with the 
mpz_t ones:

   while ((scale - 18) >= min_exp
     && mpz_divisible_p (value, p_mpze10[18])
     && mpz_cmpabs (value, max_value) > 0) {
  	mpz_divexact (value, value, p_mpze10[18]);
	scale -= 18;
   }

and ... the total time of the benchmark increased remarkably.
This is the main point of this mail, as noted in the questions below.

To heavily reduce the numbers of the loop I've added another loop before 
that uses a scale of 34 (which obviously overflows the unsigned long and 
therefore an mpz_t is needed in this case).
As expected it heavily reduces the number of iterations in all paths, 
but as not expected it increases the total time a lot.

Main questions:

* Is a performance drop expected when using the mpz_t variants of those
   two functions?
   If 'yes' a note "prefer the _ui-variant for better performance" in the
   docs under "Division Functions" would be useful, similar to the
   mpz_divexact note which is in there.

* Is there an expected difference about this time increase when generic/
   optimized versions of libgmp are used - or is the performance decrease
   expected to happen in both cases (I guess so)?
   If 'no' then this may be a useful addition to the docs, too.

Side questions:

* Would you prefer 1000000000000000000UL or 10UL ^ 18 in the code?
   I assume _all_ compilers would pre-compute the value, but I'm not
   sure if this depends on optimization options / flags.
* Do you have any other suggestions on improving the code shown above?

Thanks for taking the time to answer,
Simon