mpz_set_mem() question

niXman's Github github.nixman at
Fri Feb 23 14:00:35 CET 2024


im working on an idea to a speed up parsing of input strings for mpz/mpq/mpf.

in the process of studying the implementation of those functions I paid attention to the most obvious reasons slowing down parsing:
0) calls to `strlen()`/`strchr()` etc functions.
1) memory allocations and copying into, just to add the final '\0' char. (mpq_set_str)
2) preliminary memory allocations and copying into char-by-char just to full processing the input string. (mpz_set_str/mpf_set_str)

let me explain with the simplest example on the `mpz_set_str()` function:
please look at this line:

after we have discarded leading zeros and spaces and determined the `base` (if the user has not specified it) - we call `strlen()` to calculate the string length and then to allocate memory that we will use in this ( loop to put the same number of (transformed) bytes into allocated memory.

since this library is for working with big numbers, I assume that input strings can be millions, and even billions of characters long!

in my opinion, it is not good to use `strlen()` in this case, as well as allocating memory the size of the source string, and then looping through the source string character-by-character just to transform each char using `__gmp_digit_value_tab` table, because this can not be done!

firstly - in most cases when this function is used, a user already knows the length of the string.
no matter how this line was received - read from a file or from a socket, the user read data from the source until a delimiter character or a terminating '\0' was encountered.
this allows us to get rid of the `strlen()` call by providing an additional arg to the function.

secondly - let's follow the path of using an intermediate byte array.
0) here ( we call the `mpn_set_str()` function which we pass our intermediate byte array and its size.
1) inside this function, if `base` is a power of two, we iterate through our temporary array from the end:
2) it is in this line ( that we can transform the original string character by character using `__gmp_digit_value_tab` table and without using a temporary array!
3) in the case when `base` is not a power of two - we go either to `mpn_bc_set_str()` or we generate a power table and go to `mpn_dc_set_str()`.
(I have some thoughts about power table generation before calling `mpn_set_str()` function in order to generate it only once in the case when user understands exactly why he needs it and how it works)

this way we save memory - because we don't use a temporary array for transformated input string, and time - because we don't iterate over the string and the array twice.

a similar problem exists in the `mpf_set_str()` and `mpq_set_str()` functions.

so I have a few questions:
- since everything described is only in my head, I would like to know whether such a contribution will be accepted?

if yes, then I will take on this task.
- I propose to add a functions with a similar name to the public API, that is: mpz_set_str2/mpq_set_str2/mpf_set_str2/mpn_set_str2 .
(or suggest a suffix yourself)

- could someone provide a link to the contribution guide? (yes, I found, but it doesn’t have what I need)

- unimportant now, but im wondering if libgmp provides a way to initialize the really-big-number by updating it piecemeal, for situations where the source string are really huge, several gigabytes in size? ie smth like `loop { read(fd, buf); mpz_update_str(buf); }`

perhaps any thoughts or questions?

with Best regards from niXman!

More information about the gmp-discuss mailing list