Support underscores for mpz/mpq assignments from strings

Fri Jun 11 16:20:01 UTC 2021

> On 11 Jun 2021, at 17:06, Vincent Lefevre <vincent at vinc17.net> wrote:
> 
> On 2021-06-11 15:02:11 +0200, Hans Åberg wrote:
>>>> The current international standard is to use as decimal separator either a period '.'or a comma ',', and as number separator spaces ' '.
>>>> https://en.wikipedia.org/wiki/Decimal_separator#Current_standards
>>> 
>>> But note that this is mainly for output (for humans), not to read back
>>> values.
>> 
>> Humans can copy and paste. If I do that with the numbers below that
>> use a space as digit separator, and paste into the calculator app,
>> then it works on MacOS, but not iOS.
> 
> Given the fact that numbers can come from anywhere and do not
> necessarily follow *this* particular international standard,
> the issue is that this is ambiguous: What does "12,345" mean?
> 12345 (assuming that "," is a thousands separator) or 12.345
> (assuming that it is the decimal separator)?

The former is a likely use in the US despite NIST recommending the international standard.

> A function that can handle numbers obtained from copy-paste must
> take into account much more than what this standard specifies.

Also between programming languages.

>>> This means that the library would have to parse and copy the value
>>> for GMP, something already done by GMP: see
>>> 
>>> /* Remove spaces from the string and convert the result from ASCII to a
>>>    byte array.  */
>>> 
>>> in mpz/set_str.c. This is a bit of a waste. IMHO, a GMP function that
>>> accepts a byte array would be better for use by special libraries, with
>>> their own parsing rule. Or advise to use mpn_set_str in such cases?
>> 
>> This is a C standard (discarding spaces) that is is also present in C++, so changing it may break some programs.
>> https://en.cppreference.com/w/cpp/string/basic_string/stoul
> 
> No, only initial spaces are discarded: "Discards any whitespace
> characters (as identified by calling isspace()) until the first
> non-whitespace character is found[...]".
> 
> Once a non-whitespace character is found, spaces are no longer
> allowed.

OK.

>> When using a lexer program like Flex, one typically matches the
>> whole number string, and then passes it onto a function like
>> mpz_set_str (as opposed to computing the number value in the lexer).
>> Doing these translations are probably not time critical: a parser
>> typically spends most time in the actions and lexer, and less in the
>> parser part.
> 
> It would still be better to avoid the overhead.

It would complicate the code, increasing the possibility for errors, which are worse.

>> So to facilitate that, you might have a special function that
>> indicates which characters should be discarded, and the decimal
>> separators. The international standard mentioned above would require
>> the latter to be a string, like ",.". Setting the former to " "
>> would be the C/C++ behavior, "" would not discard anything.
> 
> Note that " " would not be the C/C++ behavior (see above).
> And depending on applications, prefixes may or may not be allowed
> (in particular, the 0 prefix for octal supported by mpz_set_str
> is dangerous).

Then add another string for initial space to ignore: Unicode supports all kinds of spaces.

But you can probably cannot change mpz_set_str, because it is modelled on C behavior.