special latin1 do not print as glyphs in current devel on windows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede
Sorry, if I am spamming/not using the right list, but I think I might be
onto a regression in current devel.

Namely, special (non-ASCII) characters with latin1 encoding do not get
printed as glyphs with R 3.5.0 devel but were with R 3.4.1.

This output is from

# R version 3.4.1 (2017-06-30) -- "Single Candle"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰") # Euro, en-dash, promille
> # v3.4.1 prints latin1 characters fine
> print(x)
[1] "€" "–" "‰"

And this (and all following) output is from

# R Under development (unstable) (2017-07-30 r73000) -- "Unsuffered
Consequences"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰") # Euro, en-dash, promille
> # printed as escapes with 3.5.0 devel
> print(x)
[1] "\u0080" "\u0096" "\u0089"

The possible regression ends here, all following output is the same with
v.3.4.1 and 3.5.0 devel.

Possibly a second, but IMHO related issue is that encoding to UTF-8 does
not help and that information is lost when encoding back to latin1.

First, chars are printed as escapes as well, when converted to UTF-8, which
is unexpected, considering that escapes can be printed as glyphs (see
below).

> Encoding(x)
[1] "latin1" "latin1" "latin1"
> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> print(x_utf8)
[1] "\u0080" "\u0096" "\u0089"

Converting back to native is lossy (which, to me, is also unexpected).

# When converting x_utf8 back to native encoding, chars are not marked as
latin-1 ...
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown" "unknown"
> print(x_nat)
[1] "<U+0080>" "<U+0096>" "<U+0089>"

Other unicode chars print fine as glyphs when entered as escapes (cf.
enc2utf8(x) above)

> z <- c("\u215B", "\u2105", "\u03B7") # 1/8, c/o, eta
> Encoding(z)
[1] "UTF-8" "UTF-8" "UTF-8"
> print(z)
[1] "⅛" "℅" "η"

But changing encoding is also not such a good idea here.

> z_nat <- enc2native(z)
> Encoding(z_nat)
[1] "unknown" "unknown" "unknown"
> z_utf8 <- enc2utf8(z_nat)
> Encoding(z_utf8)
[1] "unknown" "unknown" "unknown"
> print(z_utf8)
[1] "<U+215B>" "<U+2105>" "<U+03B7>"

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede
Upon further inspection, I think these are at least two problems.
First the issue with printing latin1/cp1252 characters in the "80" to "9F"
code range.

x <- c("€", "–", "‰")
Encoding(x)
print(x)

I assume that these are Unicode escapes!? (Given that Encoding(x) shows
"latin1" I'd rather expect latin1/cp1252 escapes here, but these would be
e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252 btw.)
Now I don't know why print tries to convert to Unicode, but if these indeed
are Unicode escapes, then there is something wrong with the conversion from
cp1252 to Unicode.
In general, most cp1252 char codes translate to Unicode like CP1252: "00"
-> Unicode "0000", "01" -> "0001", "02" -> "0002", etc. see
http://www.cp1252.com/.
The exception is the cp1252 "80" to "9F" code range. E.g. the Euro sign is
"80" in cp1252 but "20AC" in Unicode, endash "96" in cp1252, "2013" in
Unicode.
The same error seems to happen with

enc2utf8(x)

Now with iconv() the result is as expected.

iconv(x, to = "UTF-8")


The second problem IMO is that encoding markers get lost with the enc2*
functions

x_utf8 <- enc2utf8(x)
Encoding(x_utf8)
x_nat <- enc2native(x_utf8)
Encoding(x_nat)

Again, this is not the case with iconv()

x_iutf8 <- iconv(x, to = "UTF-8")
Encoding(x_iutf8)
x_inat <- iconv(x_iutf8, from = "UTF-8")
Encoding(x_inat)

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede
Sorry, I should have included my console output, obviously. So here we go:

Wrong UTF-8 escapes with using print in v3.5.0 devel:

# R Under development (unstable) (2017-07-30 r73000) -- "Unsuffered
Consequences"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰")
> Encoding(x)
[1] "latin1" "latin1" "latin1"
> print(x)
[1] "\u0080" "\u0096" "\u0089"

Same output with enc2utf8()

> enc2utf8(x)
[1] "\u0080" "\u0096" "\u0089"

With iconv() the result is as expected.

> iconv(x, to = "UTF-8")
[1] "€" "–" "‰"

The second problem IMO is that encoding markers get lost with the enc2*
functions

> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown" "unknown"

This is not the case with iconv()

> x_iutf8 <- iconv(x, to = "UTF-8")
> Encoding(x_iutf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
[1] "latin1" "latin1" "latin1"

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Prof Brian Ripley
In reply to this post by Daniel Possenriede
You seem confused about Latin-1: those characters are not in Latin-1.
(MicroSoft code pages are a proprietary encoding, some code pages such
as CP1252 being extensions to Latin-1.)

You have not given the 'at a minimum information' asked for in the
posting guide so we have no way to reproduce this, and without showing
us the output on your system, we have no idea what you saw.

[As a convenience to Windows users, R does in some cases assume that
they are using Latin-1 encodings. If they use extensions to Latin-1 then
there are no guarantees that code written for strict Latin-1 will work.]

On 01/08/2017 10:19, Daniel Possenriede wrote:

> Upon further inspection, I think these are at least two problems.
> First the issue with printing latin1/cp1252 characters in the "80" to "9F"
> code range.
>
> x <- c("€", "–", "‰")
> Encoding(x)
> print(x)
>
> I assume that these are Unicode escapes!? (Given that Encoding(x) shows
> "latin1" I'd rather expect latin1/cp1252 escapes here, but these would be
> e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252 btw.)
> Now I don't know why print tries to convert to Unicode, but if these indeed
> are Unicode escapes, then there is something wrong with the conversion from
> cp1252 to Unicode.
> In general, most cp1252 char codes translate to Unicode like CP1252: "00"
> -> Unicode "0000", "01" -> "0001", "02" -> "0002", etc. see
> http://www.cp1252.com/.
> The exception is the cp1252 "80" to "9F" code range. E.g. the Euro sign is
> "80" in cp1252 but "20AC" in Unicode, endash "96" in cp1252, "2013" in
> Unicode.
> The same error seems to happen with
>
> enc2utf8(x)
>
> Now with iconv() the result is as expected.
>
> iconv(x, to = "UTF-8")
>
>
> The second problem IMO is that encoding markers get lost with the enc2*
> functions

As you are changing encodings, you do not want to preserve encoding!

> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)

In an actual Latin-1 locale on Linux

 > x_utf8 <- c("éè", "\u20ac", "\u2013")
 > Encoding(x_utf8)
[1] "latin1" "UTF-8"  "UTF-8"
 > enc2native(x_utf8)
[1] "éè"     "<U+20AC>" "<U+2013>"
 > Encoding(.Last.value)
[1] "latin1"  "unknown" "unknown"

as expected.

> Again, this is not the case with iconv()
>
> x_iutf8 <- iconv(x, to = "UTF-8")
> Encoding(x_iutf8)
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)

iconv is converting from/to the current locale's encoding, presumably
CP1252, not from the marked encoding (as the help page states explicitly.)

--
Brian D. Ripley,                  [hidden email]
Emeritus Professor of Applied Statistics, University of Oxford

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede
Thank you!. My apologies again for not including the console output in my
message before. I sent another e-mail with the output in the meantime, so
it should be a bit clearer now, what I am seeing. In case I missed
something, please let me know.

Yes, I am using latin1 and cp1252 interchangebly here, mostly because
Encoding() is reporting the encoding as "latin1". You presumed correctly
that my current/default locale's encoding is CP1252. (I also mentioned that
my locale is LC_COLLATE=German_Germany.1252 before).


As you are changing encodings, you do not want to preserve encoding!
>

I am not interested in preserving encodings. What I am worried about is
that the encoding is not marked anymore, i.e. that Encoding() returns
"unknown".
In cp1252 encoding on Windows (note that I am using the cp1252 escape
"\x80" and not the Unicode "\u20AC")

> x_utf8 <- enc2utf8(c("€", "\x80"))
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8"
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown"

See also Kirill's message to this list: "ASCII strings are marked as ASCII
internally, but this information doesn't seem to be available, e.g.,
Encoding() returns "unknown" for such strings "
http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-tp4733523.html

>
> Again, this is not the case with iconv()
>>
>> x_iutf8 <- iconv(x, to = "UTF-8")
>> Encoding(x_iutf8)
>> x_inat <- iconv(x_iutf8, from = "UTF-8")
>> Encoding(x_inat)
>>
>
> iconv is converting from/to the current locale's encoding, presumably
> CP1252, not from the marked encoding (as the help page states explicitly.)
>

I am aware that iconv is not using the marked encoding, but that you either
have to set it explicitly or it uses the current locale's default encoding.
As I said I am worried about the fact that the encoding markers get lost
with the enc2* functions or rather they are not set correctly. I am just
using the iconv example to show that iconv is able to set the encoding
markers correctly. So it seems generally possible.

> x_iutf8 <- iconv(c("€", "\x80"), to = "UTF-8")
> Encoding(x_iutf8)
[1] "UTF-8" "UTF-8"
> x_iutf8
[1] "€" "€"
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
[1] "latin1" "latin1"
> x_inat
[1] "\u0080" "\u0080"

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Patrick Perry-2
Regarding the Windows character encoding issues Daniel Possenriede
posted about earlier this month, where non-Latin-1 strings were getting
marked as such
(https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ):

The issue is that on Windows, when the character locale is Windows-1252,
R marks some (possibly all) native non-ASCII strings as "latin1". I
posted a related bug report:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug
report also includes a link to a fix for a related issue: converting
strings from Windows native to UTF-8.

There is a work-around for this bug in the current development version
of the 'corpus' package (not on CRAN yet). See
https://github.com/patperry/r-corpus/issues/5 . I have tested this on a
Windows-1252 install of R, but I have not tested it on a Windows install
in another locale. It'd be great if someone with such an install would
test the fix and report back, either here or on the github issue.


Patrick

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Daniel Possenriede
This is a follow-up on my initial posts regarding character encodings on
Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html)
and Patrick Perry's reply
(https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
particular (thank you for the links and the bug report!). My initial
posts were quite chaotic (and partly wrong), so I am trying to clear
things up a bit.

Actually, the title of my original message "special latin1 [characters]
do not print as glyphs in current devel on windows" is already wrong,
because the problem exists with characters with CP1252 encoding in the
80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 !=
CP1252. The characters in the 80-9F code point range are not even part
of ISO/IEC 8859-1 a.k.a. latin1, see for example
https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they
were, however, and that is exactly the problem, IMHO.

Let me show you what I mean. (All output from R 3.5 r73238, see
sessionInfo at the end)

 > Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
 > x <- c("€", "ž", "š", "ü")
 > sapply(x, charToRaw)
\u0080 \u009e \u009a  ü
80 9e 9a fc

"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
show the "ü" just as an example of a non-ASCII character outside that
range (and because Patrick Perry used it in his bug report which might
be a (slightly) different problem, but I will get to that later.)

 > print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
example should be \u20ac not \u0080.
(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C
(translateCharUTF8?))?)

 > print("\u20ac")
[1] "€"

The characters in x are marked as "latin1".

 > Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this is
incorrect for "€", "ž", and "š", which simply do not exist in latin1.

As per the documentation, "enc2utf8 convert[s] elements of character
vectors to [...] UTF-8 [...], taking any marked encoding into account."
Since the marked encoding is wrong, so is the output of enc2utf8().

 > enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

Now, when we set the encoding to "unknown" everything works fine.

 > x_un <- x
 > Encoding(x_un) <- "unknown"
 > print(x_un)
[1] "€" "ž" "š" "ü"
 > (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"

Long story short: The characters in the 80 to 9F range should not be
marked as "latin1" on CP1252 locales, IMHO.

As a side-note: the output of localeToCharset() is also problematic,
since ISO8859-1 != CP1252.

 > localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report
(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
Windows, enc2utf8("ü") yields "|".'

Unfortunately, I cannot reproduce this with the CP1252 locale, as can be
seen above. Probably, because the bug applies to the C locale (sorry if
this is somewhere apparent in the bug report and I missed it).

 > Sys.setlocale("LC_CTYPE", "C")
[1] "C"
 > enc2utf8("ü")
[1] "|"
 > charToRaw("ü")
[1] fc
 > Encoding("ü")
[1] "unknown"

This does not seem to be related to the marked encoding of the string,
so it seems to me that this is a different problem than the one above.

Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

 > sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Patrick Perry-2
This particular issue has a simple fix. Currently, the "R_check_locale"
function includes the following code starting at line 244 in
src/main/platform.c:

#ifdef Win32
     {
     char *ctype = setlocale(LC_CTYPE, NULL), *p;
     p = strrchr(ctype, '.');
     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
     /* Not 100% correct, but CP1252 is a superset */
     known_to_be_latin1 = latin1locale = (localeCP == 1252);
     }
#endif

The "1252" should be "28591"; see
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 
.

> Daniel Possenriede <mailto:[hidden email]>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings
> on Windows
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and
> Patrick Perry's reply
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
> particular (thank you for the links and the bug report!). My initial
> posts were quite chaotic (and partly wrong), so I am trying to clear
> things up a bit.
>
> Actually, the title of my original message "special latin1
> [characters] do not print as glyphs in current devel on windows" is
> already wrong, because the problem exists with characters with CP1252
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully
> pointed out, latin1 != CP1252. The characters in the 80-9F code point
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
> show the "ü" just as an example of a non-ASCII character outside that
> range (and because Patrick Perry used it in his bug report which might
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character
> vectors to [...] UTF-8 [...], taking any marked encoding into
> account." Since the marked encoding is wrong, so is the output of
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic,
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can
> be seen above. Probably, because the bug applies to the C locale
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string,
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: special latin1 do not print as glyphs in current devel on windows

Patrick Perry-2
Just following up on this since the associated bug report just got
closed (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 )
because my original bug report was incomplete, and did not include
sessionInfo() or LC_CTYPE.

Admittedly, my original bug report was a little confused. I have since
gained a better understanding of the issue. I want to confirm that this
(a) is a real bug in base, R, not RStudio (b) provide more context. It
looks like the real issue is that R marks native strings as "latin1"
when the declared character locale is Windows-1252. This causes problems
when converting to UTF-8. See Daniel Possenriede's email below for much
more detail, including his sessionInfo() and a reproducible example .

The development version of the `stringi` package and the CRAN version of
the `utf8` package both have workarounds for this bug. (See, e.g.
https://github.com/gagolews/stringi/issues/287 and the links to the
related issues).


Patrick

> Patrick Perry <mailto:[hidden email]>
> September 14, 2017 at 7:47 AM
> This particular issue has a simple fix. Currently, the
> "R_check_locale" function includes the following code starting at line
> 244 in src/main/platform.c:
>
> #ifdef Win32
>     {
>     char *ctype = setlocale(LC_CTYPE, NULL), *p;
>     p = strrchr(ctype, '.');
>     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
>     /* Not 100% correct, but CP1252 is a superset */
>     known_to_be_latin1 = latin1locale = (localeCP == 1252);
>     }
> #endif
>
> The "1252" should be "28591"; see
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 
> .
>
>
> Daniel Possenriede <mailto:[hidden email]>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings
> on Windows
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and
> Patrick Perry's reply
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
> particular (thank you for the links and the bug report!). My initial
> posts were quite chaotic (and partly wrong), so I am trying to clear
> things up a bit.
>
> Actually, the title of my original message "special latin1
> [characters] do not print as glyphs in current devel on windows" is
> already wrong, because the problem exists with characters with CP1252
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully
> pointed out, latin1 != CP1252. The characters in the 80-9F code point
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
> show the "ü" just as an example of a non-ASCII character outside that
> range (and because Patrick Perry used it in his bug report which might
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character
> vectors to [...] UTF-8 [...], taking any marked encoding into
> account." Since the marked encoding is wrong, so is the output of
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic,
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can
> be seen above. Probably, because the bug applies to the C locale
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string,
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>
> Patrick Perry <mailto:[hidden email]>
> August 27, 2017 at 11:40 AM
> Regarding the Windows character encoding issues Daniel Possenriede
> posted about earlier this month, where non-Latin-1 strings were
> getting marked as such
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ):
>
> The issue is that on Windows, when the character locale is
> Windows-1252, R marks some (possibly all) native non-ASCII strings as
> "latin1". I posted a related bug report:
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug
> report also includes a link to a fix for a related issue: converting
> strings from Windows native to UTF-8.
>
> There is a work-around for this bug in the current development version
> of the 'corpus' package (not on CRAN yet). See
> https://github.com/patperry/r-corpus/issues/5 . I have tested this on
> a Windows-1252 install of R, but I have not tested it on a Windows
> install in another locale. It'd be great if someone with such an
> install would test the fix and report back, either here or on the
> github issue.
>
>
> Patrick


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel