Encoding issues

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding issues

Iñaki Ucar
Hi,

We found a (to our eyes) strange behaviour that might be a bug. First
a little bit of context. The 'units' package allows us to set the unit
using both SE or NSE. E.g., these both work in the same way:

units::set_units(1:10, "μm")
#> Units: [μm]
#> [1]  1  2  3  4  5  6  7  8  9 10

units::set_units(1:10, μm)
#> Units: [μm]
#> [1]  1  2  3  4  5  6  7  8  9 10

That's micrometers, and works fine if the session charset is UTF-8.
Now the funny part comes with Windows. The first version, with quotes,
works fine, but the second one fails. This is easy to demonstrate from
Linux:

LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, "μm")'
#> Units: [μm]
#> [1]  1  2  3  4  5  6  7  8  9 10

LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, μm)'
#> Error: unexpected input in "units::set_units(1:10, μ"
#> Execution halted

However, if you use the first version, with quotes, in an example, and
the package is checked on Windows, it fails too (see
https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
package declares UTF-8 encoding, so none of these errors should, in
principle, happen. Am I wrong?

Thanks in advance, regards,
Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issues

Gábor Csárdi
From "Writing R Extensions":

"Only ASCII characters (and the control characters tab, formfeed, LF
and CR) should be used in code files."

So I am afraid you cannot use μm.

Gabor

On Mon, Feb 18, 2019 at 3:36 PM Iñaki Ucar <[hidden email]> wrote:

>
> Hi,
>
> We found a (to our eyes) strange behaviour that might be a bug. First
> a little bit of context. The 'units' package allows us to set the unit
> using both SE or NSE. E.g., these both work in the same way:
>
> units::set_units(1:10, "μm")
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> units::set_units(1:10, μm)
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> That's micrometers, and works fine if the session charset is UTF-8.
> Now the funny part comes with Windows. The first version, with quotes,
> works fine, but the second one fails. This is easy to demonstrate from
> Linux:
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, "μm")'
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, μm)'
> #> Error: unexpected input in "units::set_units(1:10, μ"
> #> Execution halted
>
> However, if you use the first version, with quotes, in an example, and
> the package is checked on Windows, it fails too (see
> https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
> package declares UTF-8 encoding, so none of these errors should, in
> principle, happen. Am I wrong?
>
> Thanks in advance, regards,
> Iñaki
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issues

Iñaki Ucar
On Mon, 18 Feb 2019 at 17:27, Gábor Csárdi <[hidden email]> wrote:
>
> From "Writing R Extensions":
>
> "Only ASCII characters (and the control characters tab, formfeed, LF
> and CR) should be used in code files."
>
> So I am afraid you cannot use μm.

Thanks, Gábor, I missed that bit. Then, is an .Rd file considered a
"code file"? Our surprise comes from the fact that the quoted version
works fine in a test file, but not in an example. Anyway, if they
cause such a documented trouble, it seems that the safest option is to
avoid its use in the first place.

Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issues

Tomas Kalibera
In reply to this post by Iñaki Ucar
On 2/18/19 4:36 PM, Iñaki Ucar wrote:

> Hi,
>
> We found a (to our eyes) strange behaviour that might be a bug. First
> a little bit of context. The 'units' package allows us to set the unit
> using both SE or NSE. E.g., these both work in the same way:
>
> units::set_units(1:10, "μm")
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> units::set_units(1:10, μm)
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> That's micrometers, and works fine if the session charset is UTF-8.
> Now the funny part comes with Windows. The first version, with quotes,
> works fine, but the second one fails. This is easy to demonstrate from
> Linux:
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, "μm")'
> #> Units: [μm]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, μm)'
> #> Error: unexpected input in "units::set_units(1:10, μ"
> #> Execution halted
>
> However, if you use the first version, with quotes, in an example, and
> the package is checked on Windows, it fails too (see
> https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
> package declares UTF-8 encoding, so none of these errors should, in
> principle, happen. Am I wrong?

Hi Iñaki,

if you want to report a bug against R, please try to provide a minimum
reproducible example that only uses base packages (not units) and please
also see WRE sections 1.3, 1.6.3, including:

"There is a portable way to have arbitrary text in character strings
(only) in your R code, which is to supply them in Unicode as ‘\uxxxx’
escapes."

"If your package specifies an encoding in its DESCRIPTION file, you
should run these tools in a locale which makes use of that encoding"
(includes R CMD check)

Even though there are portable ways to have a string constant literal in
source code in UTF-8, not representable in the current native encoding
(e.g. using \u escapes), it does not mean that such a string can be
freely used in R. Many operations require conversion to the current
native encoding, which will cause an error or unexpected result. Such
conversions can happen any time (except when they are documented not to
happen).

Implementing an API that will work with such strings in a package would
be hard to get right, but not impossible. NSE will not work
(non-representable strings, which are not string constant literals, are
not supported). One can save a lot of headaches by using only ASCII in
function APIs.

Best
Tomas

>
> Thanks in advance, regards,
> Iñaki
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issues

Gábor Csárdi
In reply to this post by Iñaki Ucar
On Mon, Feb 18, 2019 at 4:42 PM Iñaki Ucar <[hidden email]> wrote:

>
> On Mon, 18 Feb 2019 at 17:27, Gábor Csárdi <[hidden email]> wrote:
> >
> > From "Writing R Extensions":
> >
> > "Only ASCII characters (and the control characters tab, formfeed, LF
> > and CR) should be used in code files."
> >
> > So I am afraid you cannot use μm.
>
> Thanks, Gábor, I missed that bit. Then, is an .Rd file considered a
> "code file"? Our surprise comes from the fact that the quoted version
> works fine in a test file, but not in an example. Anyway, if they
> cause such a documented trouble, it seems that the safest option is to
> avoid its use in the first place.

I don't think an Rd file is considered a code file, but you might have
problems there as well, as I believe that Rd files are manipulated in
the local encoding.

Gabor

>
> Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel