String encoding problem

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

String encoding problem

hadley wickham
If you print:

"\xc9\x82\xbf"

you get

 "\u0242\xbf"

But if you try and evaluate that string you get:

>  "\u0242\xbf"
Error: mixing Unicode and octal/hex escapes in a string is not allowed

(Probably will only happen on mac/linux with default utf-8 encoding)

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

Duncan Murdoch-2
On 07/07/2016 10:57 AM, Hadley Wickham wrote:

> If you print:
>
> "\xc9\x82\xbf"
>
> you get
>
>  "\u0242\xbf"
>
> But if you try and evaluate that string you get:
>
>>  "\u0242\xbf"
> Error: mixing Unicode and octal/hex escapes in a string is not allowed
>
> (Probably will only happen on mac/linux with default utf-8 encoding)

I'm not sure what should happen here, but that's not a legal string in a
UTF-8 locale, so it's not too surprising that things go wonky.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

hadley wickham
On Thu, Jul 7, 2016 at 10:11 AM, Duncan Murdoch
<[hidden email]> wrote:

> On 07/07/2016 10:57 AM, Hadley Wickham wrote:
>>
>> If you print:
>>
>> "\xc9\x82\xbf"
>>
>> you get
>>
>>  "\u0242\xbf"
>>
>> But if you try and evaluate that string you get:
>>
>>>  "\u0242\xbf"
>>
>> Error: mixing Unicode and octal/hex escapes in a string is not allowed
>>
>> (Probably will only happen on mac/linux with default utf-8 encoding)
>
>
> I'm not sure what should happen here, but that's not a legal string in a
> UTF-8 locale, so it's not too surprising that things go wonky.

Here's bit more context on how I got that sequence of bytes:

x <- "こんにちは"
y <- iconv(x, to = "Shift-JIS")
Encoding(y)
y

I did this to create an example to demonstrate how to handle encoding
problems, and it's bit frustrating that I have to manually mangle the
string in order to be able to re-use it in another session.  Maybe
strings with unknown encoding shouldn't use unicode escapes?

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

Simon Urbanek

> On Jul 7, 2016, at 11:40 AM, Hadley Wickham <[hidden email]> wrote:
>
> On Thu, Jul 7, 2016 at 10:11 AM, Duncan Murdoch
> <[hidden email]> wrote:
>> On 07/07/2016 10:57 AM, Hadley Wickham wrote:
>>>
>>> If you print:
>>>
>>> "\xc9\x82\xbf"
>>>
>>> you get
>>>
>>> "\u0242\xbf"
>>>
>>> But if you try and evaluate that string you get:
>>>
>>>> "\u0242\xbf"
>>>
>>> Error: mixing Unicode and octal/hex escapes in a string is not allowed
>>>
>>> (Probably will only happen on mac/linux with default utf-8 encoding)
>>
>>
>> I'm not sure what should happen here, but that's not a legal string in a
>> UTF-8 locale, so it's not too surprising that things go wonky.
>
> Here's bit more context on how I got that sequence of bytes:
>
> x <- "こんにちは"
> y <- iconv(x, to = "Shift-JIS")
> Encoding(y)
> y
>
> I did this to create an example to demonstrate how to handle encoding
> problems, and it's bit frustrating that I have to manually mangle the
> string in order to be able to re-use it in another session.  Maybe
> strings with unknown encoding shouldn't use unicode escapes?
>

The real issue is that the only supported encoding of strings in R are native (=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS locale, that encoding is not supported in your R, so the result of the iconv() above is not a valid R string, just a sequence of bytes that R doesn't know how to deal with. It tries to interpret it in your locale (UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing this in C locale yields a different result:

> x
[1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"
> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
> y
[1] "\202\261\202\361\202\311\202\277\202\315"

If you want a result that does not depend on your locale and is none of the supported encodings, you have to declare it as bytes (back in UTF-8):

> Encoding(y)="bytes"
> y
[1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"
> iconv(y, from="Shift-JIS", to="utf-8")
[1] "こんにちは"

But that has its own perils such as the fact that you cannot dput() byte-encoded strings.

Cheers,
Simon

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

hadley wickham
>>> I'm not sure what should happen here, but that's not a legal string in a
>>> UTF-8 locale, so it's not too surprising that things go wonky.
>>
>> Here's bit more context on how I got that sequence of bytes:
>>
>> x <- "こんにちは"
>> y <- iconv(x, to = "Shift-JIS")
>> Encoding(y)
>> y
>>
>> I did this to create an example to demonstrate how to handle encoding
>> problems, and it's bit frustrating that I have to manually mangle the
>> string in order to be able to re-use it in another session.  Maybe
>> strings with unknown encoding shouldn't use unicode escapes?
>>
>
> The real issue is that the only supported encoding of strings in R are native (=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS locale, that encoding is not supported in your R, so the result of the iconv() above is not a valid R string, just a sequence of bytes that R doesn't know how to deal with. It tries to interpret it in your locale (UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing this in C locale yields a different result:
>
>> x
> [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"
>> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
>> y
> [1] "\202\261\202\361\202\311\202\277\202\315"
>
> If you want a result that does not depend on your locale and is none of the supported encodings, you have to declare it as bytes (back in UTF-8):
>
>> Encoding(y)="bytes"
>> y
> [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"
>> iconv(y, from="Shift-JIS", to="utf-8")
> [1] "こんにちは"
>
> But that has its own perils such as the fact that you cannot dput() byte-encoded strings.

Right - I'm aware of that.  But to me, it doesn't seem correct to
print a string that is not a valid R string. Why is an unknown
encoding printed like UTF-8?

Hadley

--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

Peter Dalgaard-2

> On 07 Jul 2016, at 18:15 , Hadley Wickham <[hidden email]> wrote:
>
> Right - I'm aware of that.  But to me, it doesn't seem correct to
> print a string that is not a valid R string. Why is an unknown
> encoding printed like UTF-8?
>

It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse, but it seems to me that there are three alternatives:

- refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a UTF-8 string" or so)
- refuse to print it (print(x) gives "cannot print non-UTF-8 string")
- what happens now

and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007, but I suspect that there are demons down the line which is why it is not happening now. (Does it ring a bell with anyone?)

-pd


--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: String encoding problem

Duncan Murdoch-2
On 07/07/2016 12:51 PM, peter dalgaard wrote:

> > On 07 Jul 2016, at 18:15 , Hadley Wickham <[hidden email]> wrote:
> >
> > Right - I'm aware of that.  But to me, it doesn't seem correct to
> > print a string that is not a valid R string. Why is an unknown
> > encoding printed like UTF-8?
> >
>
> It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse, but it seems to me that there are three alternatives:
>
> - refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a UTF-8 string" or so)
> - refuse to print it (print(x) gives "cannot print non-UTF-8 string")
> - what happens now
>
> and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007, but I suspect that there are demons down the line which is why it is not happening now. (Does it ring a bell with anyone?)

A fifth option would be to use only hex escapes when invalid UTF-8 was
found.  That would echo back the input in this case.  No idea if it
would cause other problems.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel