source(), parse(), and foreign UTF-8 characters

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

source(), parse(), and foreign UTF-8 characters

Kirill Müller
Hi


I'm having trouble sourcing or parsing a UTF-8 file that contains
characters that are not representable in the current locale ("foreign
characters") on Windows. The source() function stops with an error, the
parse() function reencodes all foreign characters using the <U+xxxx>
notation. I have added a reproducible example below the message.

This seems well within the bounds of documented behavior, although the
documentation to source() could mention that the file can't contain
foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
I'm willing to invest substantial time to help with that. Before
starting to write a detailed proposal, I feel that I need a better
understanding of the problem, and I'm grateful for any feedback you
might have.

I have looked into character encodings in the context of the dplyr
package, and I have observed the following behavior:

- Strings are treated preferentially in the native encoding
- Only upon specific request (via translateCharUTF8() or enc2utf8() or
...), they are translated to UTF-8 and marked as such
- On UTF-8 systems, strings are never marked as UTF-8
- ASCII strings are marked as ASCII internally, but this information
doesn't seem to be available, e.g., Encoding() returns "unknown" for
such strings
- Most functions in R are encoding-agnostic: they work the same
regardless if they receive a native or UTF-8 encoded string if they are
properly tagged
- One important difference are symbols, which must be in the native
encoding (and are always converted to native encoding, using <U+xxxx>
escapes)
- I/O is centered around the native encoding, e.g., writeLines() always
reencodes to the native encoding
- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
narrow strings everywhere and convert them back and forth when using
platform APIs that don’t support UTF-8 ...". (It is written in the
context of the UTF-16 encoding used internally on Windows, but seems to
apply just the same here for the native encoding.) I think that Unicode
support in R could be greatly improved if we follow these guidelines.
This seems to mean:

- Convert strings to UTF-8 as soon as possible, and mark them as such
(also on systems where UTF-8 is the native encoding)
- Translate to native only upon specific request, e.g., in calls to API
functions or perhaps for .C()
- Use UTF-8 for symbols
- Avoid the forced round-trip to the native encoding in I/O functions
and for parsing (but still read/write native by default)
- Carefully look into serialization and plot devices
- Add helper functions that simplify mundane tasks such as
reading/writing a UTF-8 encoded file

I'm sure I've missed many potential pitfalls, your input is greatly
appreciated. Thanks for your attention.

Further ressources: A write-up by Prof. Ripley [2], a section in R-ints
[3], a blog post by Ista Zahn [4], a StackOverflow search [5].


Best regards

Kirill



[1] http://utf8everywhere.org/#conclusions

[2] https://developer.r-project.org/Encodings_and_R.html

[3]
https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs

[3]
http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/

[4]
http://stackoverflow.com/search?tab=votes&q=%5br%5d%20encoding%20windows%20is%3aquestion



# Use one of the following:
id <- "Gl\u00fcck"
id <- "\u5e78\u798f"
id <- "\u0441\u0447\u0430\u0441\u0442\u044c\u0435"
id <- "\ud589\ubcf5"

file_contents <- paste0('"', id, '"')
Encoding(file_contents)
raw_file_contents <- charToRaw(file_contents)

path <- tempfile(fileext = ".R")
writeBin(raw_file_contents, path)
file.size(path)
length(raw_file_contents)

# Escapes the string
parse(text = file_contents)

# Throws an error
print(source(path, encoding = "UTF-8"))

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: source(), parse(), and foreign UTF-8 characters

Duncan Murdoch-2
On 09/05/2017 3:42 AM, Kirill Müller wrote:

> Hi
>
>
> I'm having trouble sourcing or parsing a UTF-8 file that contains
> characters that are not representable in the current locale ("foreign
> characters") on Windows. The source() function stops with an error, the
> parse() function reencodes all foreign characters using the <U+xxxx>
> notation. I have added a reproducible example below the message.
>
> This seems well within the bounds of documented behavior, although the
> documentation to source() could mention that the file can't contain
> foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
> I'm willing to invest substantial time to help with that. Before
> starting to write a detailed proposal, I feel that I need a better
> understanding of the problem, and I'm grateful for any feedback you
> might have.
>
> I have looked into character encodings in the context of the dplyr
> package, and I have observed the following behavior:
>
> - Strings are treated preferentially in the native encoding
> - Only upon specific request (via translateCharUTF8() or enc2utf8() or
> ...), they are translated to UTF-8 and marked as such
> - On UTF-8 systems, strings are never marked as UTF-8
> - ASCII strings are marked as ASCII internally, but this information
> doesn't seem to be available, e.g., Encoding() returns "unknown" for
> such strings
> - Most functions in R are encoding-agnostic: they work the same
> regardless if they receive a native or UTF-8 encoded string if they are
> properly tagged
> - One important difference are symbols, which must be in the native
> encoding (and are always converted to native encoding, using <U+xxxx>
> escapes)
> - I/O is centered around the native encoding, e.g., writeLines() always
> reencodes to the native encoding
> - There is the "bytes" encoding which avoids reencoding.
>
> I haven't looked into serialization or plot devices yet.
>
> The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
> narrow strings everywhere and convert them back and forth when using
> platform APIs that don’t support UTF-8 ...". (It is written in the
> context of the UTF-16 encoding used internally on Windows, but seems to
> apply just the same here for the native encoding.) I think that Unicode
> support in R could be greatly improved if we follow these guidelines.
> This seems to mean:
>
> - Convert strings to UTF-8 as soon as possible, and mark them as such
> (also on systems where UTF-8 is the native encoding)
> - Translate to native only upon specific request, e.g., in calls to API
> functions or perhaps for .C()
> - Use UTF-8 for symbols
> - Avoid the forced round-trip to the native encoding in I/O functions
> and for parsing (but still read/write native by default)
> - Carefully look into serialization and plot devices
> - Add helper functions that simplify mundane tasks such as
> reading/writing a UTF-8 encoded file

Those are good long term goals, though I think the effort is easier than
you think.  Rather than attempting to do it all at once, you should look
for ways to do it gradually and submit self-contained patches.  In many
cases it doesn't matter if strings are left in the local encoding,
because the encoding doesn't matter.  The problems arise when UTF-8
strings are converted to the local encoding before it's necessary,
because that's a lossy conversion.  So a simple way to proceed is to
identify where these conversions occur, and remove them one-by-one.

Currently I'm working on bug 16098, "Windows doesn't handle high Unicode
code points".  It doesn't require many changes at all to handle input of
those characters; all the remaining issues are avoiding the problems you
identify above.  The origin of the issue is the fact that in Windows
wchar_t is only 16 bits (not big enough to hold all Unicode code
points).  As far as I know, Windows has no standard type to hold a
Unicode code point, most of the run-time functions still use the 16 bit
wchar_t.

I think once that bug is dealt with, 90+% of the remaining issues could
be solved by avoiding translateChar on Windows.  This could be done by
avoiding it everywhere, or by acting as though Windows is running in a
UTF-8 locale until you actually need to write to a file.  Other systems
tend to have UTF-8 locales in common use, so they're already fine.

You offered to spend time on this.  I'd appreciate some checks of the
patch I'm developing for 16098, and also some research into how certain
things (e.g. the iswprint function) are handled on Windows.

Duncan Murdoch

>
> I'm sure I've missed many potential pitfalls, your input is greatly
> appreciated. Thanks for your attention.
>
> Further ressources: A write-up by Prof. Ripley [2], a section in R-ints
> [3], a blog post by Ista Zahn [4], a StackOverflow search [5].
>
>
> Best regards
>
> Kirill
>
>
>
> [1] http://utf8everywhere.org/#conclusions
>
> [2] https://developer.r-project.org/Encodings_and_R.html
>
> [3]
> https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs
>
> [3]
> http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/
>
> [4]
> http://stackoverflow.com/search?tab=votes&q=%5br%5d%20encoding%20windows%20is%3aquestion
>
>
>
> # Use one of the following:
> id <- "Gl\u00fcck"
> id <- "\u5e78\u798f"
> id <- "\u0441\u0447\u0430\u0441\u0442\u044c\u0435"
> id <- "\ud589\ubcf5"
>
> file_contents <- paste0('"', id, '"')
> Encoding(file_contents)
> raw_file_contents <- charToRaw(file_contents)
>
> path <- tempfile(fileext = ".R")
> writeBin(raw_file_contents, path)
> file.size(path)
> length(raw_file_contents)
>
> # Escapes the string
> parse(text = file_contents)
>
> # Throws an error
> print(source(path, encoding = "UTF-8"))
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: source(), parse(), and foreign UTF-8 characters

Kirill Müller
On 09.05.2017 13:19, Duncan Murdoch wrote:

> On 09/05/2017 3:42 AM, Kirill Müller wrote:
>> Hi
>>
>>
>> I'm having trouble sourcing or parsing a UTF-8 file that contains
>> characters that are not representable in the current locale ("foreign
>> characters") on Windows. The source() function stops with an error, the
>> parse() function reencodes all foreign characters using the <U+xxxx>
>> notation. I have added a reproducible example below the message.
>>
>> This seems well within the bounds of documented behavior, although the
>> documentation to source() could mention that the file can't contain
>> foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
>> I'm willing to invest substantial time to help with that. Before
>> starting to write a detailed proposal, I feel that I need a better
>> understanding of the problem, and I'm grateful for any feedback you
>> might have.
>>
>> I have looked into character encodings in the context of the dplyr
>> package, and I have observed the following behavior:
>>
>> - Strings are treated preferentially in the native encoding
>> - Only upon specific request (via translateCharUTF8() or enc2utf8() or
>> ...), they are translated to UTF-8 and marked as such
>> - On UTF-8 systems, strings are never marked as UTF-8
>> - ASCII strings are marked as ASCII internally, but this information
>> doesn't seem to be available, e.g., Encoding() returns "unknown" for
>> such strings
>> - Most functions in R are encoding-agnostic: they work the same
>> regardless if they receive a native or UTF-8 encoded string if they are
>> properly tagged
>> - One important difference are symbols, which must be in the native
>> encoding (and are always converted to native encoding, using <U+xxxx>
>> escapes)
>> - I/O is centered around the native encoding, e.g., writeLines() always
>> reencodes to the native encoding
>> - There is the "bytes" encoding which avoids reencoding.
>>
>> I haven't looked into serialization or plot devices yet.
>>
>> The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
>> narrow strings everywhere and convert them back and forth when using
>> platform APIs that don’t support UTF-8 ...". (It is written in the
>> context of the UTF-16 encoding used internally on Windows, but seems to
>> apply just the same here for the native encoding.) I think that Unicode
>> support in R could be greatly improved if we follow these guidelines.
>> This seems to mean:
>>
>> - Convert strings to UTF-8 as soon as possible, and mark them as such
>> (also on systems where UTF-8 is the native encoding)
>> - Translate to native only upon specific request, e.g., in calls to API
>> functions or perhaps for .C()
>> - Use UTF-8 for symbols
>> - Avoid the forced round-trip to the native encoding in I/O functions
>> and for parsing (but still read/write native by default)
>> - Carefully look into serialization and plot devices
>> - Add helper functions that simplify mundane tasks such as
>> reading/writing a UTF-8 encoded file
>
> Those are good long term goals, though I think the effort is easier
> than you think.  Rather than attempting to do it all at once, you
> should look for ways to do it gradually and submit self-contained
> patches.  In many cases it doesn't matter if strings are left in the
> local encoding, because the encoding doesn't matter.  The problems
> arise when UTF-8 strings are converted to the local encoding before
> it's necessary, because that's a lossy conversion.  So a simple way to
> proceed is to identify where these conversions occur, and remove them
> one-by-one.
Thanks, Duncan, this looks like a good start indeed. Did you really mean
to say "the effort is easier than I think"? It would be great if I had
overestimated the effort, I seldom do. That said, I'd be grateful if you
could review/integrate/... future patches of mine towards parsing and
sourcing of UTF-8 files with foreign characters, this problem seems to
be self-contained (but perhaps not that easy).

I still think symbols should be in UTF-8, and this change might be
difficult to split into smaller changes, especially if taking into
account serialization and other potential pitfalls.

>
> Currently I'm working on bug 16098, "Windows doesn't handle high
> Unicode code points".  It doesn't require many changes at all to
> handle input of those characters; all the remaining issues are
> avoiding the problems you identify above.  The origin of the issue is
> the fact that in Windows wchar_t is only 16 bits (not big enough to
> hold all Unicode code points).  As far as I know, Windows has no
> standard type to hold a Unicode code point, most of the run-time
> functions still use the 16 bit wchar_t.
I didn't mention non-BMP characters, they are an important issue as well.

>
> I think once that bug is dealt with, 90+% of the remaining issues
> could be solved by avoiding translateChar on Windows.  This could be
> done by avoiding it everywhere, or by acting as though Windows is
> running in a UTF-8 locale until you actually need to write to a file.  
> Other systems tend to have UTF-8 locales in common use, so they're
> already fine.
I'd argue against platform-specific switches. "grep translateChar" has
found more than 500 hits on R-devel, and I suspect that checking each of
them will take some time, one way or another.
>
> You offered to spend time on this.  I'd appreciate some checks of the
> patch I'm developing for 16098, and also some research into how
> certain things (e.g. the iswprint function) are handled on Windows.
I have looked in SVN, but couldn't find commits authored by you related
to this bug in any of the branches. Could you please point me at your
patch, or send it to me? Thanks!


-Kirill


>
> Duncan Murdoch
>>
>> I'm sure I've missed many potential pitfalls, your input is greatly
>> appreciated. Thanks for your attention.
>>
>> Further ressources: A write-up by Prof. Ripley [2], a section in R-ints
>> [3], a blog post by Ista Zahn [4], a StackOverflow search [5].
>>
>>
>> Best regards
>>
>> Kirill
>>
>>
>>
>> [1] http://utf8everywhere.org/#conclusions
>>
>> [2] https://developer.r-project.org/Encodings_and_R.html
>>
>> [3]
>> https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs 
>>
>>
>> [3]
>> http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/ 
>>
>>
>> [4]
>> http://stackoverflow.com/search?tab=votes&q=%5br%5d%20encoding%20windows%20is%3aquestion 
>>
>>
>>
>>
>> # Use one of the following:
>> id <- "Gl\u00fcck"
>> id <- "\u5e78\u798f"
>> id <- "\u0441\u0447\u0430\u0441\u0442\u044c\u0435"
>> id <- "\ud589\ubcf5"
>>
>> file_contents <- paste0('"', id, '"')
>> Encoding(file_contents)
>> raw_file_contents <- charToRaw(file_contents)
>>
>> path <- tempfile(fileext = ".R")
>> writeBin(raw_file_contents, path)
>> file.size(path)
>> length(raw_file_contents)
>>
>> # Escapes the string
>> parse(text = file_contents)
>>
>> # Throws an error
>> print(source(path, encoding = "UTF-8"))
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: source(), parse(), and foreign UTF-8 characters

Duncan Murdoch-2
On 09/05/2017 5:46 PM, Kirill Müller wrote:

> On 09.05.2017 13:19, Duncan Murdoch wrote:
>> On 09/05/2017 3:42 AM, Kirill Müller wrote:
>>> Hi
>>>
>>>
>>> I'm having trouble sourcing or parsing a UTF-8 file that contains
>>> characters that are not representable in the current locale ("foreign
>>> characters") on Windows. The source() function stops with an error, the
>>> parse() function reencodes all foreign characters using the <U+xxxx>
>>> notation. I have added a reproducible example below the message.
>>>
>>> This seems well within the bounds of documented behavior, although the
>>> documentation to source() could mention that the file can't contain
>>> foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
>>> I'm willing to invest substantial time to help with that. Before
>>> starting to write a detailed proposal, I feel that I need a better
>>> understanding of the problem, and I'm grateful for any feedback you
>>> might have.
>>>
>>> I have looked into character encodings in the context of the dplyr
>>> package, and I have observed the following behavior:
>>>
>>> - Strings are treated preferentially in the native encoding
>>> - Only upon specific request (via translateCharUTF8() or enc2utf8() or
>>> ...), they are translated to UTF-8 and marked as such
>>> - On UTF-8 systems, strings are never marked as UTF-8
>>> - ASCII strings are marked as ASCII internally, but this information
>>> doesn't seem to be available, e.g., Encoding() returns "unknown" for
>>> such strings
>>> - Most functions in R are encoding-agnostic: they work the same
>>> regardless if they receive a native or UTF-8 encoded string if they are
>>> properly tagged
>>> - One important difference are symbols, which must be in the native
>>> encoding (and are always converted to native encoding, using <U+xxxx>
>>> escapes)
>>> - I/O is centered around the native encoding, e.g., writeLines() always
>>> reencodes to the native encoding
>>> - There is the "bytes" encoding which avoids reencoding.
>>>
>>> I haven't looked into serialization or plot devices yet.
>>>
>>> The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
>>> narrow strings everywhere and convert them back and forth when using
>>> platform APIs that don’t support UTF-8 ...". (It is written in the
>>> context of the UTF-16 encoding used internally on Windows, but seems to
>>> apply just the same here for the native encoding.) I think that Unicode
>>> support in R could be greatly improved if we follow these guidelines.
>>> This seems to mean:
>>>
>>> - Convert strings to UTF-8 as soon as possible, and mark them as such
>>> (also on systems where UTF-8 is the native encoding)
>>> - Translate to native only upon specific request, e.g., in calls to API
>>> functions or perhaps for .C()
>>> - Use UTF-8 for symbols
>>> - Avoid the forced round-trip to the native encoding in I/O functions
>>> and for parsing (but still read/write native by default)
>>> - Carefully look into serialization and plot devices
>>> - Add helper functions that simplify mundane tasks such as
>>> reading/writing a UTF-8 encoded file
>>
>> Those are good long term goals, though I think the effort is easier
>> than you think.  Rather than attempting to do it all at once, you
>> should look for ways to do it gradually and submit self-contained
>> patches.  In many cases it doesn't matter if strings are left in the
>> local encoding, because the encoding doesn't matter.  The problems
>> arise when UTF-8 strings are converted to the local encoding before
>> it's necessary, because that's a lossy conversion.  So a simple way to
>> proceed is to identify where these conversions occur, and remove them
>> one-by-one.
> Thanks, Duncan, this looks like a good start indeed. Did you really mean
> to say "the effort is easier than I think"? It would be great if I had
> overestimated the effort, I seldom do. That said, I'd be grateful if you
> could review/integrate/... future patches of mine towards parsing and
> sourcing of UTF-8 files with foreign characters, this problem seems to
> be self-contained (but perhaps not that easy).

I'll definitely look at small ones.  I'm not sure I'll have enough time
to do really big ones, so it's best to try to break things up into small
bites.

>
> I still think symbols should be in UTF-8, and this change might be
> difficult to split into smaller changes, especially if taking into
> account serialization and other potential pitfalls.
>
>>
>> Currently I'm working on bug 16098, "Windows doesn't handle high
>> Unicode code points".  It doesn't require many changes at all to
>> handle input of those characters; all the remaining issues are
>> avoiding the problems you identify above.  The origin of the issue is
>> the fact that in Windows wchar_t is only 16 bits (not big enough to
>> hold all Unicode code points).  As far as I know, Windows has no
>> standard type to hold a Unicode code point, most of the run-time
>> functions still use the 16 bit wchar_t.
> I didn't mention non-BMP characters, they are an important issue as well.
>
>>
>> I think once that bug is dealt with, 90+% of the remaining issues
>> could be solved by avoiding translateChar on Windows.  This could be
>> done by avoiding it everywhere, or by acting as though Windows is
>> running in a UTF-8 locale until you actually need to write to a file.
>> Other systems tend to have UTF-8 locales in common use, so they're
>> already fine.
> I'd argue against platform-specific switches. "grep translateChar" has
> found more than 500 hits on R-devel, and I suspect that checking each of
> them will take some time, one way or another.
>>
>> You offered to spend time on this.  I'd appreciate some checks of the
>> patch I'm developing for 16098, and also some research into how
>> certain things (e.g. the iswprint function) are handled on Windows.
> I have looked in SVN, but couldn't find commits authored by you related
> to this bug in any of the branches. Could you please point me at your
> patch, or send it to me? Thanks!

Not committed yet; I'll send you something tomorrow.  Generally we try
to leave R in a good state after every commit, so I don't tend to commit
patches until I'm happy with them.  I started on this one a couple of
years ago, didn't like how it was going, and went on to easier things;
I've come back to it very recently.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel