Support underscores for mpz/mpq assignments from strings

Fri Jun 11 15:06:28 UTC 2021

On 2021-06-11 15:02:11 +0200, Hans Åberg wrote:
> >> The current international standard is to use as decimal separator either a period '.'or a comma ',', and as number separator spaces ' '.
> >> https://en.wikipedia.org/wiki/Decimal_separator#Current_standards
> > 
> > But note that this is mainly for output (for humans), not to read back
> > values.
> 
> Humans can copy and paste. If I do that with the numbers below that
> use a space as digit separator, and paste into the calculator app,
> then it works on MacOS, but not iOS.

Given the fact that numbers can come from anywhere and do not
necessarily follow *this* particular international standard,
the issue is that this is ambiguous: What does "12,345" mean?
12345 (assuming that "," is a thousands separator) or 12.345
(assuming that it is the decimal separator)?

A function that can handle numbers obtained from copy-paste must
take into account much more than what this standard specifies.

> > This means that the library would have to parse and copy the value
> > for GMP, something already done by GMP: see
> > 
> >  /* Remove spaces from the string and convert the result from ASCII to a
> >     byte array.  */
> > 
> > in mpz/set_str.c. This is a bit of a waste. IMHO, a GMP function that
> > accepts a byte array would be better for use by special libraries, with
> > their own parsing rule. Or advise to use mpn_set_str in such cases?
> 
> This is a C standard (discarding spaces) that is is also present in C++, so changing it may break some programs.
> https://en.cppreference.com/w/cpp/string/basic_string/stoul

No, only initial spaces are discarded: "Discards any whitespace
characters (as identified by calling isspace()) until the first
non-whitespace character is found[...]".

Once a non-whitespace character is found, spaces are no longer
allowed.

> When using a lexer program like Flex, one typically matches the
> whole number string, and then passes it onto a function like
> mpz_set_str (as opposed to computing the number value in the lexer).
> Doing these translations are probably not time critical: a parser
> typically spends most time in the actions and lexer, and less in the
> parser part.

It would still be better to avoid the overhead.

> So to facilitate that, you might have a special function that
> indicates which characters should be discarded, and the decimal
> separators. The international standard mentioned above would require
> the latter to be a string, like ",.". Setting the former to " "
> would be the C/C++ behavior, "" would not discard anything.

Note that " " would not be the C/C++ behavior (see above).
And depending on applications, prefixes may or may not be allowed
(in particular, the 0 prefix for octal supported by mpz_set_str
is dangerous).

-- 
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)