Performance mpz_divexact / mpz_divexact_ui
Simon Sobisch
simonsobisch at gnu.org
Sun Jan 16 19:55:49 CET 2022
While bench-marking an application I was presented with a tight loop
that does division by 10 until this isn't possible any more.
Scenario:
To reduce the number of iterations done in the necessary loop, the code
works with multiple loops using powers of 10. The biggest one uses the
following code, with the numbers of calls and cpu time annotated
according to callgrind:
while ((scale - 18) >= min_exp
&& mpz_divisible_ui_p (value, 1000000000000000000UL)
// 35% within the function call above, done 20071593 times
&& mpz_cmpabs (value, max_value) > 0) {
mpz_divexact_ui (value, value, 1000000000000000000UL);
// 54% within the function call above, done 2059434 times
scale -= 18;
}
The loops that are following in the code handle the scales 13,9,6,3 -
they are are also called often but not as often as the first one (in
this benchmark) and take less then 0.01% cpu-time altogether.
Things tested to decrease the run-time:
As there is a pre-computed mpz_t with all the constants used in the
program already, I've did a test replacing the _ui functions with the
mpz_t ones:
while ((scale - 18) >= min_exp
&& mpz_divisible_p (value, p_mpze10[18])
&& mpz_cmpabs (value, max_value) > 0) {
mpz_divexact (value, value, p_mpze10[18]);
scale -= 18;
}
and ... the total time of the benchmark increased remarkably.
This is the main point of this mail, as noted in the questions below.
To heavily reduce the numbers of the loop I've added another loop before
that uses a scale of 34 (which obviously overflows the unsigned long and
therefore an mpz_t is needed in this case).
As expected it heavily reduces the number of iterations in all paths,
but as not expected it increases the total time a lot.
Main questions:
* Is a performance drop expected when using the mpz_t variants of those
two functions?
If 'yes' a note "prefer the _ui-variant for better performance" in the
docs under "Division Functions" would be useful, similar to the
mpz_divexact note which is in there.
* Is there an expected difference about this time increase when generic/
optimized versions of libgmp are used - or is the performance decrease
expected to happen in both cases (I guess so)?
If 'no' then this may be a useful addition to the docs, too.
Side questions:
* Would you prefer 1000000000000000000UL or 10UL ^ 18 in the code?
I assume _all_ compilers would pre-compute the value, but I'm not
sure if this depends on optimization options / flags.
* Do you have any other suggestions on improving the code shown above?
Thanks for taking the time to answer,
Simon
More information about the gmp-discuss
mailing list