Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

David Byrne
Bug
Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
file containing the infinity symbol (' ∞ ') results in the infinity
symbol imported as the number 8. Other Unicode characters seem
unaffected, example, Zhe: ж

Expected Behavior:
The imported data.frame should represent the infinity symbol as the
expected 'Inf' so that normal mathematical operations can be processed

Stack Overflow Post:
I created a question on Stack Overflow where one other member was able
to reproduce the same issues I was having. This question can be found
at:
https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int

Method to Reproduce - 1:
A simple method to reproduce this issues is to use R-Studio: In the
console, type the following:
> read.table(text=" ∞", encoding="UTF-8")

The result should be a data.frame with a single value of '8'

Repeating the same with ж Results in correct expected behavior

Method to Reproduce - 2:
Create a .csv file containing the infinity and Zhe characters (I have
attached the file for convenience, hopefully it is no rejected by your
email service). Launch an interactive session using

> r --vanilla

Enter the following statement taking care to replace the
<path-to-file> with the appropriate one:

> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")


This should result in a two element data.frame; the first being the
incorrect value of 8 with an additional <U+FEFF> and the second the
correct value of Zhe.

Note the additional <U+FEFF> prefixed to the front of the '8'. This
appears to be a hidden character for the purposes of letting editors
know the encoding. The following link has some explanation however, it
states this is caused by excel. The file I created was done so using
notepad and not Excel.

https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7

System Details:
OS:
> Windows 10.0.17134 Build 17134


R Version:

> platform       x86_64-w64-mingw32
> arch           x86_64
> os             mingw32
> system         x86_64, mingw32
> status
> major          3
> minor          4.1
> year           2017
> month          06
> day            30
> svn rev        72865
> language       R
> version.string R version 3.4.1 (2017-06-30)
> nickname       Single Candle
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Peter Dalgaard-2
This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.

-pd

> On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]> wrote:
>
> Bug
> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> file containing the infinity symbol (' ∞ ') results in the infinity
> symbol imported as the number 8. Other Unicode characters seem
> unaffected, example, Zhe: ж
>
> Expected Behavior:
> The imported data.frame should represent the infinity symbol as the
> expected 'Inf' so that normal mathematical operations can be processed
>
> Stack Overflow Post:
> I created a question on Stack Overflow where one other member was able
> to reproduce the same issues I was having. This question can be found
> at:
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>
> Method to Reproduce - 1:
> A simple method to reproduce this issues is to use R-Studio: In the
> console, type the following:
>> read.table(text=" ∞", encoding="UTF-8")
>
> The result should be a data.frame with a single value of '8'
>
> Repeating the same with ж Results in correct expected behavior
>
> Method to Reproduce - 2:
> Create a .csv file containing the infinity and Zhe characters (I have
> attached the file for convenience, hopefully it is no rejected by your
> email service). Launch an interactive session using
>
>> r --vanilla
>
> Enter the following statement taking care to replace the
> <path-to-file> with the appropriate one:
>
>> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")
>
>
> This should result in a two element data.frame; the first being the
> incorrect value of 8 with an additional <U+FEFF> and the second the
> correct value of Zhe.
>
> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> appears to be a hidden character for the purposes of letting editors
> know the encoding. The following link has some explanation however, it
> states this is caused by excel. The file I created was done so using
> notepad and not Excel.
>
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>
> System Details:
> OS:
>> Windows 10.0.17134 Build 17134
>
>
> R Version:
>> platform       x86_64-w64-mingw32
>> arch           x86_64
>> os             mingw32
>> system         x86_64, mingw32
>> status
>> major          3
>> minor          4.1
>> year           2017
>> month          06
>> day            30
>> svn rev        72865
>> language       R
>> version.string R version 3.4.1 (2017-06-30)
>> nickname       Single Candle
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

David Byrne
I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
most likely correct; it looks like its Windows specific.

On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:

>
> This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
>
> -pd
>
> > On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]> wrote:
> >
> > Bug
> > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > file containing the infinity symbol (' ∞ ') results in the infinity
> > symbol imported as the number 8. Other Unicode characters seem
> > unaffected, example, Zhe: ж
> >
> > Expected Behavior:
> > The imported data.frame should represent the infinity symbol as the
> > expected 'Inf' so that normal mathematical operations can be processed
> >
> > Stack Overflow Post:
> > I created a question on Stack Overflow where one other member was able
> > to reproduce the same issues I was having. This question can be found
> > at:
> > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> >
> > Method to Reproduce - 1:
> > A simple method to reproduce this issues is to use R-Studio: In the
> > console, type the following:
> >> read.table(text=" ∞", encoding="UTF-8")
> >
> > The result should be a data.frame with a single value of '8'
> >
> > Repeating the same with ж Results in correct expected behavior
> >
> > Method to Reproduce - 2:
> > Create a .csv file containing the infinity and Zhe characters (I have
> > attached the file for convenience, hopefully it is no rejected by your
> > email service). Launch an interactive session using
> >
> >> r --vanilla
> >
> > Enter the following statement taking care to replace the
> > <path-to-file> with the appropriate one:
> >
> >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")
> >
> >
> > This should result in a two element data.frame; the first being the
> > incorrect value of 8 with an additional <U+FEFF> and the second the
> > correct value of Zhe.
> >
> > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > appears to be a hidden character for the purposes of letting editors
> > know the encoding. The following link has some explanation however, it
> > states this is caused by excel. The file I created was done so using
> > notepad and not Excel.
> >
> > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> >
> > System Details:
> > OS:
> >> Windows 10.0.17134 Build 17134
> >
> >
> > R Version:
> >> platform       x86_64-w64-mingw32
> >> arch           x86_64
> >> os             mingw32
> >> system         x86_64, mingw32
> >> status
> >> major          3
> >> minor          4.1
> >> year           2017
> >> month          06
> >> day            30
> >> svn rev        72865
> >> language       R
> >> version.string R version 3.4.1 (2017-06-30)
> >> nickname       Single Candle
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email]  Priv: [hidden email]
>
>
>
>
>
>
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Daniel Possenriede
There seems to be something odd with "∞" on Windows (and not only with
read.table)
In native encoding (cp-1252 in my case), "∞" gets converted to "8"

x <-  "∞"
Encoding(x)
#> [1] "unknown"
print(x)
#> [1] "8"
charToRaw(x)
#> [1] 38

"∞" is indeed "8"

identical(x, "8")
#> [1] TRUE

Everything seems fine if  "∞" is UTF-8 encoded.

y <- "\u221E"
Encoding(y)
#> [1] "UTF-8"
print(y)
#> [1]  "∞"
charToRaw(y)
#> [1] e2 88 9e

Unless the string is converted back to native encoding.

format(y)
#> [1] "8"

This ought to be "<U+221E>", equivalently to

format("∝")
#> [1] "<U+221D>"

Session Info:

si <- sessionInfo()
si$running
#> [1] "Windows 10 x64 (build 17134)"
si$R.version$version.string
#> [1] "R version 3.5.2 (2018-12-20)"
si$locale
#> [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"



Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
[hidden email]>:

> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> most likely correct; it looks like its Windows specific.
>
> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:
> >
> > This doesn't seem to be happening on MacOS, neither in Terminal nor
> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >
> > -pd
> >
> > > On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]>
> wrote:
> > >
> > > Bug
> > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > symbol imported as the number 8. Other Unicode characters seem
> > > unaffected, example, Zhe: ж
> > >
> > > Expected Behavior:
> > > The imported data.frame should represent the infinity symbol as the
> > > expected 'Inf' so that normal mathematical operations can be processed
> > >
> > > Stack Overflow Post:
> > > I created a question on Stack Overflow where one other member was able
> > > to reproduce the same issues I was having. This question can be found
> > > at:
> > >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > >
> > > Method to Reproduce - 1:
> > > A simple method to reproduce this issues is to use R-Studio: In the
> > > console, type the following:
> > >> read.table(text=" ∞", encoding="UTF-8")
> > >
> > > The result should be a data.frame with a single value of '8'
> > >
> > > Repeating the same with ж Results in correct expected behavior
> > >
> > > Method to Reproduce - 2:
> > > Create a .csv file containing the infinity and Zhe characters (I have
> > > attached the file for convenience, hopefully it is no rejected by your
> > > email service). Launch an interactive session using
> > >
> > >> r --vanilla
> > >
> > > Enter the following statement taking care to replace the
> > > <path-to-file> with the appropriate one:
> > >
> > >> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> encoding="UTF-8")
> > >
> > >
> > > This should result in a two element data.frame; the first being the
> > > incorrect value of 8 with an additional <U+FEFF> and the second the
> > > correct value of Zhe.
> > >
> > > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > > appears to be a hidden character for the purposes of letting editors
> > > know the encoding. The following link has some explanation however, it
> > > states this is caused by excel. The file I created was done so using
> > > notepad and not Excel.
> > >
> > >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > >
> > > System Details:
> > > OS:
> > >> Windows 10.0.17134 Build 17134
> > >
> > >
> > > R Version:
> > >> platform       x86_64-w64-mingw32
> > >> arch           x86_64
> > >> os             mingw32
> > >> system         x86_64, mingw32
> > >> status
> > >> major          3
> > >> minor          4.1
> > >> year           2017
> > >> month          06
> > >> day            30
> > >> svn rev        72865
> > >> language       R
> > >> version.string R version 3.4.1 (2017-06-30)
> > >> nickname       Single Candle
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: [hidden email]  Priv: [hidden email]
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Paul McQuesten
Windows Notepad prefixes UTF-8 files with a Byte Order Mark (\UFEFF).
Per https://en.wikipedia.org/wiki/Byte_order_mark, this is permitted in
UTF-8, but not required.
I suppose that there are other Windows programs which do likewise (in
addition to Excel and Notepad).

"The Unicode Standard permits the BOM in UTF-8
<https://en.wikipedia.org/wiki/UTF-8>,[3]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-3> but does not
require or recommend its use.[4]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-4> Byte order has
no meaning in UTF-8,[5]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-utf-8-bom-5> so
its only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8, or that it was converted to UTF-8 from a stream that
contained an optional BOM. The standard also does not recommend removing a
BOM when it is there, so that round-tripping between encodings does not
lose information, and so that code that relies on it continues to work.[6]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-6>[7]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-7> The IETF
recommends that if a protocol either (a) always uses UTF-8, or (b) has some
other way to indicate what encoding is being used, then it "SHOULD forbid
use of U+FEFF as a signature."[8]
<https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-rfc3629-8>"

On Thu, Feb 7, 2019 at 8:10 AM Daniel Possenriede <[hidden email]>
wrote:

> There seems to be something odd with "∞" on Windows (and not only with
> read.table)
> In native encoding (cp-1252 in my case), "∞" gets converted to "8"
>
> x <-  "∞"
> Encoding(x)
> #> [1] "unknown"
> print(x)
> #> [1] "8"
> charToRaw(x)
> #> [1] 38
>
> "∞" is indeed "8"
>
> identical(x, "8")
> #> [1] TRUE
>
> Everything seems fine if  "∞" is UTF-8 encoded.
>
> y <- "\u221E"
> Encoding(y)
> #> [1] "UTF-8"
> print(y)
> #> [1]  "∞"
> charToRaw(y)
> #> [1] e2 88 9e
>
> Unless the string is converted back to native encoding.
>
> format(y)
> #> [1] "8"
>
> This ought to be "<U+221E>", equivalently to
>
> format("∝")
> #> [1] "<U+221D>"
>
> Session Info:
>
> si <- sessionInfo()
> si$running
> #> [1] "Windows 10 x64 (build 17134)"
> si$R.version$version.string
> #> [1] "R version 3.5.2 (2018-12-20)"
> si$locale
> #> [1]
>
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>
>
>
> Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
> [hidden email]>:
>
> > I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> > most likely correct; it looks like its Windows specific.
> >
> > On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:
> > >
> > > This doesn't seem to be happening on MacOS, neither in Terminal nor
> > RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> > >
> > > -pd
> > >
> > > > On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]>
> > wrote:
> > > >
> > > > Bug
> > > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > > symbol imported as the number 8. Other Unicode characters seem
> > > > unaffected, example, Zhe: ж
> > > >
> > > > Expected Behavior:
> > > > The imported data.frame should represent the infinity symbol as the
> > > > expected 'Inf' so that normal mathematical operations can be
> processed
> > > >
> > > > Stack Overflow Post:
> > > > I created a question on Stack Overflow where one other member was
> able
> > > > to reproduce the same issues I was having. This question can be found
> > > > at:
> > > >
> >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > > >
> > > > Method to Reproduce - 1:
> > > > A simple method to reproduce this issues is to use R-Studio: In the
> > > > console, type the following:
> > > >> read.table(text=" ∞", encoding="UTF-8")
> > > >
> > > > The result should be a data.frame with a single value of '8'
> > > >
> > > > Repeating the same with ж Results in correct expected behavior
> > > >
> > > > Method to Reproduce - 2:
> > > > Create a .csv file containing the infinity and Zhe characters (I have
> > > > attached the file for convenience, hopefully it is no rejected by
> your
> > > > email service). Launch an interactive session using
> > > >
> > > >> r --vanilla
> > > >
> > > > Enter the following statement taking care to replace the
> > > > <path-to-file> with the appropriate one:
> > > >
> > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> > encoding="UTF-8")
> > > >
> > > >
> > > > This should result in a two element data.frame; the first being the
> > > > incorrect value of 8 with an additional <U+FEFF> and the second the
> > > > correct value of Zhe.
> > > >
> > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > > > appears to be a hidden character for the purposes of letting editors
> > > > know the encoding. The following link has some explanation however,
> it
> > > > states this is caused by excel. The file I created was done so using
> > > > notepad and not Excel.
> > > >
> > > >
> >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > > >
> > > > System Details:
> > > > OS:
> > > >> Windows 10.0.17134 Build 17134
> > > >
> > > >
> > > > R Version:
> > > >> platform       x86_64-w64-mingw32
> > > >> arch           x86_64
> > > >> os             mingw32
> > > >> system         x86_64, mingw32
> > > >> status
> > > >> major          3
> > > >> minor          4.1
> > > >> year           2017
> > > >> month          06
> > > >> day            30
> > > >> svn rev        72865
> > > >> language       R
> > > >> version.string R version 3.4.1 (2017-06-30)
> > > >> nickname       Single Candle
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> > > --
> > > Peter Dalgaard, Professor,
> > > Center for Statistics, Copenhagen Business School
> > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > > Phone: (+45)38153501
> > > Office: A 4.23
> > > Email: [hidden email]  Priv: [hidden email]
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Tomas Kalibera
In reply to this post by Daniel Possenriede
I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
when I paste the Unicode infinity symbol into the console, it is treated
as number 8. This is caused by Windows "best fit" default behavior in
conversion of unicode characters to characters in the current native
encoding: at some point in the past, 8 has been chosen as a good fit for
infinity in Windows. In my scenario, the conversion is invoked by RGui
before returning the input to the main R loop, even before the input
gets to the parser. In principle, we could change this particular
conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes
to pass characters that cannot be represented, this is why e.g. the
Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
substitution and pass "\u221e" for Infinity, and then the string after
being processed by the parser will be represented in UTF-8 inside R and
could be e.g. printed by the RGui console. That is something that could
be considered, but it will not solve the main problem and it may
actually cause trouble to users who are used to such substitutions
(especially when the substitutions are more intuitive, but, that may be
a matter of opinion).

The main problem is that in normal use, sooner or later R will get to
the point when it will need to do the conversion to native encoding, and
in some context where "\uxxxx" escapes will not be possible. One cannot
reliably work with strings in R that cannot be represented in the
current native encoding (except when one knows precisely how to avoid
the conversion in some specific task, but that may be brittle; so the
best-fit substitution might in principle help here). This problem does
not exist on Unix/macOS systems where the current native encoding is
UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
be the current native encoding. As has been discussed before, even
though we could rewrite in principle all calls to Windows API to use
Unicode and have all strings in UTF-8 in R, we would still have problems
when interfacing with packages that assume strings are in current native
encoding (without checking), so this problem won't be easy to fix.

Best,
Tomas

On 2/7/19 3:10 PM, Daniel Possenriede wrote:

> There seems to be something odd with "∞" on Windows (and not only with
> read.table)
> In native encoding (cp-1252 in my case), "∞" gets converted to "8"
>
> x <-  "∞"
> Encoding(x)
> #> [1] "unknown"
> print(x)
> #> [1] "8"
> charToRaw(x)
> #> [1] 38
>
> "∞" is indeed "8"
>
> identical(x, "8")
> #> [1] TRUE
>
> Everything seems fine if  "∞" is UTF-8 encoded.
>
> y <- "\u221E"
> Encoding(y)
> #> [1] "UTF-8"
> print(y)
> #> [1]  "∞"
> charToRaw(y)
> #> [1] e2 88 9e
>
> Unless the string is converted back to native encoding.
>
> format(y)
> #> [1] "8"
>
> This ought to be "<U+221E>", equivalently to
>
> format("∝")
> #> [1] "<U+221D>"
>
> Session Info:
>
> si <- sessionInfo()
> si$running
> #> [1] "Windows 10 x64 (build 17134)"
> si$R.version$version.string
> #> [1] "R version 3.5.2 (2018-12-20)"
> si$locale
> #> [1]
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>
>
>
> Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
> [hidden email]>:
>
>> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
>> most likely correct; it looks like its Windows specific.
>>
>> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:
>>> This doesn't seem to be happening on MacOS, neither in Terminal nor
>> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
>>> -pd
>>>
>>>> On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]>
>> wrote:
>>>> Bug
>>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
>>>> file containing the infinity symbol (' ∞ ') results in the infinity
>>>> symbol imported as the number 8. Other Unicode characters seem
>>>> unaffected, example, Zhe: ж
>>>>
>>>> Expected Behavior:
>>>> The imported data.frame should represent the infinity symbol as the
>>>> expected 'Inf' so that normal mathematical operations can be processed
>>>>
>>>> Stack Overflow Post:
>>>> I created a question on Stack Overflow where one other member was able
>>>> to reproduce the same issues I was having. This question can be found
>>>> at:
>>>>
>> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>>>> Method to Reproduce - 1:
>>>> A simple method to reproduce this issues is to use R-Studio: In the
>>>> console, type the following:
>>>>> read.table(text=" ∞", encoding="UTF-8")
>>>> The result should be a data.frame with a single value of '8'
>>>>
>>>> Repeating the same with ж Results in correct expected behavior
>>>>
>>>> Method to Reproduce - 2:
>>>> Create a .csv file containing the infinity and Zhe characters (I have
>>>> attached the file for convenience, hopefully it is no rejected by your
>>>> email service). Launch an interactive session using
>>>>
>>>>> r --vanilla
>>>> Enter the following statement taking care to replace the
>>>> <path-to-file> with the appropriate one:
>>>>
>>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",",
>> encoding="UTF-8")
>>>>
>>>> This should result in a two element data.frame; the first being the
>>>> incorrect value of 8 with an additional <U+FEFF> and the second the
>>>> correct value of Zhe.
>>>>
>>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This
>>>> appears to be a hidden character for the purposes of letting editors
>>>> know the encoding. The following link has some explanation however, it
>>>> states this is caused by excel. The file I created was done so using
>>>> notepad and not Excel.
>>>>
>>>>
>> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>>>> System Details:
>>>> OS:
>>>>> Windows 10.0.17134 Build 17134
>>>>
>>>> R Version:
>>>>> platform       x86_64-w64-mingw32
>>>>> arch           x86_64
>>>>> os             mingw32
>>>>> system         x86_64, mingw32
>>>>> status
>>>>> major          3
>>>>> minor          4.1
>>>>> year           2017
>>>>> month          06
>>>>> day            30
>>>>> svn rev        72865
>>>>> language       R
>>>>> version.string R version 3.4.1 (2017-06-30)
>>>>> nickname       Single Candle
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> --
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Office: A 4.23
>>> Email: [hidden email]  Priv: [hidden email]
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Peter Dalgaard-2
Fortune nomination...

> On 8 Feb 2019, at 13:07 , Tomas Kalibera <[hidden email]> wrote:
>
> This is caused by Windows "best fit" default behavior in conversion of unicode characters to characters in the current native encoding: at some point in the past, 8 has been chosen as a good fit for infinity in Windows.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Daniel Possenriede
In reply to this post by Tomas Kalibera
Tomas,

> In my scenario, the conversion is invoked by RGui before returning the
input to the main R loop, even before the input gets to the parser. In
principle, we could change this particular conversion in RGui to avoid the
substitution.

Not sure whether I am missing something here, but I used RStudio for my
examples (I should have said) and David's mentioned RStudio as well, so it
does not seem to be a problem with RGui only.

Another example for the "best fit" behaviour seems to be "Σ"
("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation):

print("Σ")
#> [1] "S"

Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview.

> even though we could rewrite in principle all calls to Windows API to use
Unicode and have all strings in UTF-8 in R, we would still have problems
when interfacing with packages that assume strings are in current native
encoding (without checking), so this problem won't be easy to fix.

Since I regularly encounter the reverse problem, i.e. packages that assume
strings are in UTF-8 encoding without checking (which isn't very
surprising, assuming that most package developers develop on Unix/macOS
systems), I'd say, "rip of the bandaid rather sooner than later". Obviously
I don't know how many bugs would surface in packages if R for Windows'
native encoding were to switch to UTF-8, but these bugs would only be
transitory, I suppose. Whereas there is a steady inflow of
assume-UTF-8-encoding-bugs in new packages and functions with the current
situation.

Best,
Daniel


Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera <
[hidden email]>:

> I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
> when I paste the Unicode infinity symbol into the console, it is treated
> as number 8. This is caused by Windows "best fit" default behavior in
> conversion of unicode characters to characters in the current native
> encoding: at some point in the past, 8 has been chosen as a good fit for
> infinity in Windows. In my scenario, the conversion is invoked by RGui
> before returning the input to the main R loop, even before the input
> gets to the parser. In principle, we could change this particular
> conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes
> to pass characters that cannot be represented, this is why e.g. the
> Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
> substitution and pass "\u221e" for Infinity, and then the string after
> being processed by the parser will be represented in UTF-8 inside R and
> could be e.g. printed by the RGui console. That is something that could
> be considered, but it will not solve the main problem and it may
> actually cause trouble to users who are used to such substitutions
> (especially when the substitutions are more intuitive, but, that may be
> a matter of opinion).
>
> The main problem is that in normal use, sooner or later R will get to
> the point when it will need to do the conversion to native encoding, and
> in some context where "\uxxxx" escapes will not be possible. One cannot
> reliably work with strings in R that cannot be represented in the
> current native encoding (except when one knows precisely how to avoid
> the conversion in some specific task, but that may be brittle; so the
> best-fit substitution might in principle help here). This problem does
> not exist on Unix/macOS systems where the current native encoding is
> UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
> be the current native encoding. As has been discussed before, even
> though we could rewrite in principle all calls to Windows API to use
> Unicode and have all strings in UTF-8 in R, we would still have problems
> when interfacing with packages that assume strings are in current native
> encoding (without checking), so this problem won't be easy to fix.
>
> Best,
> Tomas
>
> On 2/7/19 3:10 PM, Daniel Possenriede wrote:
> > There seems to be something odd with "∞" on Windows (and not only with
> > read.table)
> > In native encoding (cp-1252 in my case), "∞" gets converted to "8"
> >
> > x <-  "∞"
> > Encoding(x)
> > #> [1] "unknown"
> > print(x)
> > #> [1] "8"
> > charToRaw(x)
> > #> [1] 38
> >
> > "∞" is indeed "8"
> >
> > identical(x, "8")
> > #> [1] TRUE
> >
> > Everything seems fine if  "∞" is UTF-8 encoded.
> >
> > y <- "\u221E"
> > Encoding(y)
> > #> [1] "UTF-8"
> > print(y)
> > #> [1]  "∞"
> > charToRaw(y)
> > #> [1] e2 88 9e
> >
> > Unless the string is converted back to native encoding.
> >
> > format(y)
> > #> [1] "8"
> >
> > This ought to be "<U+221E>", equivalently to
> >
> > format("∝")
> > #> [1] "<U+221D>"
> >
> > Session Info:
> >
> > si <- sessionInfo()
> > si$running
> > #> [1] "Windows 10 x64 (build 17134)"
> > si$R.version$version.string
> > #> [1] "R version 3.5.2 (2018-12-20)"
> > si$locale
> > #> [1]
> >
> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> >
> >
> >
> > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
> > [hidden email]>:
> >
> >> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> >> most likely correct; it looks like its Windows specific.
> >>
> >> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:
> >>> This doesn't seem to be happening on MacOS, neither in Terminal nor
> >> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >>> -pd
> >>>
> >>>> On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]>
> >> wrote:
> >>>> Bug
> >>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> >>>> file containing the infinity symbol (' ∞ ') results in the infinity
> >>>> symbol imported as the number 8. Other Unicode characters seem
> >>>> unaffected, example, Zhe: ж
> >>>>
> >>>> Expected Behavior:
> >>>> The imported data.frame should represent the infinity symbol as the
> >>>> expected 'Inf' so that normal mathematical operations can be processed
> >>>>
> >>>> Stack Overflow Post:
> >>>> I created a question on Stack Overflow where one other member was able
> >>>> to reproduce the same issues I was having. This question can be found
> >>>> at:
> >>>>
> >>
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> >>>> Method to Reproduce - 1:
> >>>> A simple method to reproduce this issues is to use R-Studio: In the
> >>>> console, type the following:
> >>>>> read.table(text=" ∞", encoding="UTF-8")
> >>>> The result should be a data.frame with a single value of '8'
> >>>>
> >>>> Repeating the same with ж Results in correct expected behavior
> >>>>
> >>>> Method to Reproduce - 2:
> >>>> Create a .csv file containing the infinity and Zhe characters (I have
> >>>> attached the file for convenience, hopefully it is no rejected by your
> >>>> email service). Launch an interactive session using
> >>>>
> >>>>> r --vanilla
> >>>> Enter the following statement taking care to replace the
> >>>> <path-to-file> with the appropriate one:
> >>>>
> >>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> >> encoding="UTF-8")
> >>>>
> >>>> This should result in a two element data.frame; the first being the
> >>>> incorrect value of 8 with an additional <U+FEFF> and the second the
> >>>> correct value of Zhe.
> >>>>
> >>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> >>>> appears to be a hidden character for the purposes of letting editors
> >>>> know the encoding. The following link has some explanation however, it
> >>>> states this is caused by excel. The file I created was done so using
> >>>> notepad and not Excel.
> >>>>
> >>>>
> >>
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> >>>> System Details:
> >>>> OS:
> >>>>> Windows 10.0.17134 Build 17134
> >>>>
> >>>> R Version:
> >>>>> platform       x86_64-w64-mingw32
> >>>>> arch           x86_64
> >>>>> os             mingw32
> >>>>> system         x86_64, mingw32
> >>>>> status
> >>>>> major          3
> >>>>> minor          4.1
> >>>>> year           2017
> >>>>> month          06
> >>>>> day            30
> >>>>> svn rev        72865
> >>>>> language       R
> >>>>> version.string R version 3.4.1 (2017-06-30)
> >>>>> nickname       Single Candle
> >>>> ______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>> --
> >>> Peter Dalgaard, Professor,
> >>> Center for Statistics, Copenhagen Business School
> >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >>> Phone: (+45)38153501
> >>> Office: A 4.23
> >>> Email: [hidden email]  Priv: [hidden email]
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Tomas Kalibera
In reply to this post by David Byrne

I can reproduce with read.table(encoding="UTF-8") in RGui on Windows 10,
reading a file containing the two UTF-8 characters. The table is read
correctly into R as documented (both characters are represented in UTF-8
and marked as such), but, the conversion of Infinity to 8 and of Zhe to
<U+0436> happens later during printing using print.data.frame(). For
instance, it currently does not happen during print(as.matrix()). As I
wrote in more detail in another email in this thread, R sometimes needs
to convert strings to the current native encoding, Windows converts
Infinity to 8 by default as "best fit", but fails to convert Zhe, so R
displays the <U+436>.

It is easiest to only use input files in current native encoding, so one
could convert before passing them to R and make sure the conversion does
not have similar problems...  or use R on a non-Windows platform.
Relying on which R functions/packages can work with non-native encodings
may be brittle, but of course any R function that documents to work with
non-native encodings (like read.table(encoding=)) should do so. If not,
it will be fixed following a bug report.

I am not sure if that is what you had in mind, but conversion of
character (string) to double is a different matter. as.double() now as
documented in ?as.double returns NA for "∞" (on Linux).

Best
Tomas


On 2/7/19 11:17 AM, David Byrne wrote:

> Bug
> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> file containing the infinity symbol (' ∞ ') results in the infinity
> symbol imported as the number 8. Other Unicode characters seem
> unaffected, example, Zhe: ж
>
> Expected Behavior:
> The imported data.frame should represent the infinity symbol as the
> expected 'Inf' so that normal mathematical operations can be processed
>
> Stack Overflow Post:
> I created a question on Stack Overflow where one other member was able
> to reproduce the same issues I was having. This question can be found
> at:
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>
> Method to Reproduce - 1:
> A simple method to reproduce this issues is to use R-Studio: In the
> console, type the following:
>> read.table(text=" ∞", encoding="UTF-8")
> The result should be a data.frame with a single value of '8'
>
> Repeating the same with ж Results in correct expected behavior
>
> Method to Reproduce - 2:
> Create a .csv file containing the infinity and Zhe characters (I have
> attached the file for convenience, hopefully it is no rejected by your
> email service). Launch an interactive session using
>
>> r --vanilla
> Enter the following statement taking care to replace the
> <path-to-file> with the appropriate one:
>
>> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")
>
> This should result in a two element data.frame; the first being the
> incorrect value of 8 with an additional <U+FEFF> and the second the
> correct value of Zhe.
>
> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> appears to be a hidden character for the purposes of letting editors
> know the encoding. The following link has some explanation however, it
> states this is caused by excel. The file I created was done so using
> notepad and not Excel.
>
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>
> System Details:
> OS:
>> Windows 10.0.17134 Build 17134
>
> R Version:
>> platform       x86_64-w64-mingw32
>> arch           x86_64
>> os             mingw32
>> system         x86_64, mingw32
>> status
>> major          3
>> minor          4.1
>> year           2017
>> month          06
>> day            30
>> svn rev        72865
>> language       R
>> version.string R version 3.4.1 (2017-06-30)
>> nickname       Single Candle
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Duncan Murdoch-2
In reply to this post by Daniel Possenriede
On 08/02/2019 11:12 a.m., Daniel Possenriede wrote:

> Tomas,
>
>> In my scenario, the conversion is invoked by RGui before returning the
> input to the main R loop, even before the input gets to the parser. In
> principle, we could change this particular conversion in RGui to avoid the
> substitution.
>
> Not sure whether I am missing something here, but I used RStudio for my
> examples (I should have said) and David's mentioned RStudio as well, so it
> does not seem to be a problem with RGui only.
>
> Another example for the "best fit" behaviour seems to be "Σ"
> ("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation):
>
> print("Σ")
> #> [1] "S"
>
> Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview.
>
>> even though we could rewrite in principle all calls to Windows API to use
> Unicode and have all strings in UTF-8 in R, we would still have problems
> when interfacing with packages that assume strings are in current native
> encoding (without checking), so this problem won't be easy to fix.
>
> Since I regularly encounter the reverse problem, i.e. packages that assume
> strings are in UTF-8 encoding without checking (which isn't very
> surprising, assuming that most package developers develop on Unix/macOS
> systems), I'd say, "rip of the bandaid rather sooner than later". Obviously
> I don't know how many bugs would surface in packages if R for Windows'
> native encoding were to switch to UTF-8, but these bugs would only be
> transitory, I suppose. Whereas there is a steady inflow of
> assume-UTF-8-encoding-bugs in new packages and functions with the current
> situation.

Just one minor comment:  it is *impossible* for R for Windows "native"
encoding to switch to UTF-8, since Windows doesn't support that.  The
necessary change (which I'd support, but it's a really large amount of
work) would be for R to drop its use of native encodings internally.
Convert everything to UTF-8 on the way in, convert to native on the way out.

This is a large amount of work because R has preferred native encodings
basically forever, so there are tons of locations needing changes, and a
large effort would be required to make them.  It would likely be easier
for Windows to add UTF-8 as a native encoding.  Converting between that
and Windows internal UTF-16 is nearly trivial, much easier than many of
the conversions it does.  And Microsoft has revenues of $90 billion per
year, while R Core only has a few individuals donating their time:  so
wouldn't it make more sense to ask them to act like responsible members
of the computing community?

Duncan Murdoch

>
> Best,
> Daniel
>
>
> Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera <
> [hidden email]>:
>
>> I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
>> when I paste the Unicode infinity symbol into the console, it is treated
>> as number 8. This is caused by Windows "best fit" default behavior in
>> conversion of unicode characters to characters in the current native
>> encoding: at some point in the past, 8 has been chosen as a good fit for
>> infinity in Windows. In my scenario, the conversion is invoked by RGui
>> before returning the input to the main R loop, even before the input
>> gets to the parser. In principle, we could change this particular
>> conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes
>> to pass characters that cannot be represented, this is why e.g. the
>> Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
>> substitution and pass "\u221e" for Infinity, and then the string after
>> being processed by the parser will be represented in UTF-8 inside R and
>> could be e.g. printed by the RGui console. That is something that could
>> be considered, but it will not solve the main problem and it may
>> actually cause trouble to users who are used to such substitutions
>> (especially when the substitutions are more intuitive, but, that may be
>> a matter of opinion).
>>
>> The main problem is that in normal use, sooner or later R will get to
>> the point when it will need to do the conversion to native encoding, and
>> in some context where "\uxxxx" escapes will not be possible. One cannot
>> reliably work with strings in R that cannot be represented in the
>> current native encoding (except when one knows precisely how to avoid
>> the conversion in some specific task, but that may be brittle; so the
>> best-fit substitution might in principle help here). This problem does
>> not exist on Unix/macOS systems where the current native encoding is
>> UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
>> be the current native encoding. As has been discussed before, even
>> though we could rewrite in principle all calls to Windows API to use
>> Unicode and have all strings in UTF-8 in R, we would still have problems
>> when interfacing with packages that assume strings are in current native
>> encoding (without checking), so this problem won't be easy to fix.
>>
>> Best,
>> Tomas
>>
>> On 2/7/19 3:10 PM, Daniel Possenriede wrote:
>>> There seems to be something odd with "∞" on Windows (and not only with
>>> read.table)
>>> In native encoding (cp-1252 in my case), "∞" gets converted to "8"
>>>
>>> x <-  "∞"
>>> Encoding(x)
>>> #> [1] "unknown"
>>> print(x)
>>> #> [1] "8"
>>> charToRaw(x)
>>> #> [1] 38
>>>
>>> "∞" is indeed "8"
>>>
>>> identical(x, "8")
>>> #> [1] TRUE
>>>
>>> Everything seems fine if  "∞" is UTF-8 encoded.
>>>
>>> y <- "\u221E"
>>> Encoding(y)
>>> #> [1] "UTF-8"
>>> print(y)
>>> #> [1]  "∞"
>>> charToRaw(y)
>>> #> [1] e2 88 9e
>>>
>>> Unless the string is converted back to native encoding.
>>>
>>> format(y)
>>> #> [1] "8"
>>>
>>> This ought to be "<U+221E>", equivalently to
>>>
>>> format("∝")
>>> #> [1] "<U+221D>"
>>>
>>> Session Info:
>>>
>>> si <- sessionInfo()
>>> si$running
>>> #> [1] "Windows 10 x64 (build 17134)"
>>> si$R.version$version.string
>>> #> [1] "R version 3.5.2 (2018-12-20)"
>>> si$locale
>>> #> [1]
>>>
>> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>>>
>>>
>>>
>>> Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
>>> [hidden email]>:
>>>
>>>> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
>>>> most likely correct; it looks like its Windows specific.
>>>>
>>>> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <[hidden email]> wrote:
>>>>> This doesn't seem to be happening on MacOS, neither in Terminal nor
>>>> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
>>>>> -pd
>>>>>
>>>>>> On 7 Feb 2019, at 11:17 , David Byrne <[hidden email]>
>>>> wrote:
>>>>>> Bug
>>>>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
>>>>>> file containing the infinity symbol (' ∞ ') results in the infinity
>>>>>> symbol imported as the number 8. Other Unicode characters seem
>>>>>> unaffected, example, Zhe: ж
>>>>>>
>>>>>> Expected Behavior:
>>>>>> The imported data.frame should represent the infinity symbol as the
>>>>>> expected 'Inf' so that normal mathematical operations can be processed
>>>>>>
>>>>>> Stack Overflow Post:
>>>>>> I created a question on Stack Overflow where one other member was able
>>>>>> to reproduce the same issues I was having. This question can be found
>>>>>> at:
>>>>>>
>>>>
>> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>>>>>> Method to Reproduce - 1:
>>>>>> A simple method to reproduce this issues is to use R-Studio: In the
>>>>>> console, type the following:
>>>>>>> read.table(text=" ∞", encoding="UTF-8")
>>>>>> The result should be a data.frame with a single value of '8'
>>>>>>
>>>>>> Repeating the same with ж Results in correct expected behavior
>>>>>>
>>>>>> Method to Reproduce - 2:
>>>>>> Create a .csv file containing the infinity and Zhe characters (I have
>>>>>> attached the file for convenience, hopefully it is no rejected by your
>>>>>> email service). Launch an interactive session using
>>>>>>
>>>>>>> r --vanilla
>>>>>> Enter the following statement taking care to replace the
>>>>>> <path-to-file> with the appropriate one:
>>>>>>
>>>>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",",
>>>> encoding="UTF-8")
>>>>>>
>>>>>> This should result in a two element data.frame; the first being the
>>>>>> incorrect value of 8 with an additional <U+FEFF> and the second the
>>>>>> correct value of Zhe.
>>>>>>
>>>>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This
>>>>>> appears to be a hidden character for the purposes of letting editors
>>>>>> know the encoding. The following link has some explanation however, it
>>>>>> states this is caused by excel. The file I created was done so using
>>>>>> notepad and not Excel.
>>>>>>
>>>>>>
>>>>
>> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>>>>>> System Details:
>>>>>> OS:
>>>>>>> Windows 10.0.17134 Build 17134
>>>>>>
>>>>>> R Version:
>>>>>>> platform       x86_64-w64-mingw32
>>>>>>> arch           x86_64
>>>>>>> os             mingw32
>>>>>>> system         x86_64, mingw32
>>>>>>> status
>>>>>>> major          3
>>>>>>> minor          4.1
>>>>>>> year           2017
>>>>>>> month          06
>>>>>>> day            30
>>>>>>> svn rev        72865
>>>>>>> language       R
>>>>>>> version.string R version 3.4.1 (2017-06-30)
>>>>>>> nickname       Single Candle
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> --
>>>>> Peter Dalgaard, Professor,
>>>>> Center for Statistics, Copenhagen Business School
>>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>>> Phone: (+45)38153501
>>>>> Office: A 4.23
>>>>> Email: [hidden email]  Priv: [hidden email]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel