Potential R bug in identical

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Potential R bug in identical

Layik Hama
Hi,


My first email to r-help and as I am not sure about the issue, I wanted to ask for help first.

The comments under this thread <https://github.com/ropensci/stats19/pull/83> outline a particular string from a dataset which seems to be read by R on Windows differently to Linux and MacOS and also to bash on Ubuntu Bionic. There seems to be some weird and unidentifiable (to me) characters in front of the `Accidents_Index` column name there causing the length to be 17 rather than 14 characters.


I have inspected the string as best as I could and cannot see why we see the output from a Windows machine.


Is it an issue in `read.table()`?


Thanks


---

Layik Hama
Research Fellow

Leeds Institute for Data Analytics
Room 11.70, Worsley Building,
University of Leeds

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Potential R bug in identical

Ivan Krylov
On Thu, 17 Jan 2019 14:55:18 +0000
Layik Hama <[hidden email]> wrote:

> There seems to be some weird and unidentifiable (to me) characters in
> front of the `Accidents_Index` column name there causing the length
> to be 17 rather than 14 characters.

Repeating the reproduction steps described at the linked pull request,

$ curl -o acc2017.zip
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
$ unzip acc2017.zip
$ head -n 1 Acc.csv | hd | head -n 2
00000000  ef bb bf 41 63 63 69 64  65 6e 74 5f 49 6e 64 65  |...Accident_Inde|
00000010  78 2c 4c 6f 63 61 74 69  6f 6e 5f 45 61 73 74 69  |x,Location_Easti|

The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8.
Not sure which encoding R chooses on Windows by default, but
explicitly passing encoding="UTF-8" (or is it fileEncoding?) might
help decode it as such. (Sorry, cannot test my advice on Windows right
now.)

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Potential R bug in identical

Layik Hama
Ivan,


Thank you for digging into the string. I can confirm that the `hexdump` shows extra characters on bash, too.


The question would then be:


Why would `identical(str, "Accident_Index", ignore.case = TRUE)` behave differently on Linux/MacOS vs Windows?


Thanks


---

Layik Hama
Research Fellow

Leeds Institute for Data Analytics
Room 11.70, Worsley Building,
University of Leeds
________________________________
From: Ivan Krylov <[hidden email]>
Sent: 17 January 2019 20:40:32
To: Layik Hama
Cc: [hidden email]
Subject: Re: [R] Potential R bug in identical

On Thu, 17 Jan 2019 14:55:18 +0000
Layik Hama <[hidden email]> wrote:

> There seems to be some weird and unidentifiable (to me) characters in
> front of the `Accidents_Index` column name there causing the length
> to be 17 rather than 14 characters.

Repeating the reproduction steps described at the linked pull request,

$ curl -o acc2017.zip
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
$ unzip acc2017.zip
$ head -n 1 Acc.csv | hd | head -n 2
00000000  ef bb bf 41 63 63 69 64  65 6e 74 5f 49 6e 64 65  |...Accident_Inde|
00000010  78 2c 4c 6f 63 61 74 69  6f 6e 5f 45 61 73 74 69  |x,Location_Easti|

The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8.
Not sure which encoding R chooses on Windows by default, but
explicitly passing encoding="UTF-8" (or is it fileEncoding?) might
help decode it as such. (Sorry, cannot test my advice on Windows right
now.)

--
Best regards,
Ivan

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Potential R bug in identical

Ivan Krylov
On Thu, 17 Jan 2019 21:05:07 +0000
Layik Hama <[hidden email]> wrote:

> Why would `identical(str, "Accident_Index", ignore.case = TRUE)`
> behave differently on Linux/MacOS vs Windows?

Because str is different from "Accident_Index" on Windows: it was
decoded from bytes to characters according to different rules when file
was read.

Default encoding for files being read is specified by 'encoding'
options. On both Windows and Linux I get:

> options('encoding')
$encoding
[1] "native.enc"

For which ?file says (in section "Encoding"):

>> ‘""’ and ‘"native.enc"’ both mean the ‘native’ encoding, that is the
>> internal encoding of the current locale and hence no translation is
>> done.

Linux version of R has a UTF-8 locale (AFAIK, macOS does too) and
decodes the files as UTF-8 by default:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=ru_RU.utf8       LC_NUMERIC=C            
 [3] LC_TIME=ru_RU.utf8        LC_COLLATE=ru_RU.utf8    
 [5] LC_MONETARY=ru_RU.utf8    LC_MESSAGES=ru_RU.utf8  
 [7] LC_PAPER=ru_RU.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C          
[11] LC_MEASUREMENT=ru_RU.utf8 LC_IDENTIFICATION=C    

While on Windows R uses a single-byte encoding dependent on the locale:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251  
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C                  
[5] LC_TIME=Russian_Russia.1251    

> readLines('test.txt')[1]
[1] "п»їAccident_Index"
> nchar(readLines('test.txt')[1])
[1] 17

R on Windows can be explicitly told to decode the file as UTF-8:

> nchar(readLines(file('test.txt',encoding='UTF-8'))[1])
[1] 15

The first character of the string is the invisible byte order mark.
Thankfully, there is an easy fix for that, too. ?file additionally
says:

>> As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted for
>> reading and will remove a Byte Order Mark if present (which it
>> often is for files and webpages generated by Microsoft applications).

So this is how we get the 14-character column name we'd wanted:

> nchar(readLines(file('test.txt',encoding='UTF-8-BOM'))[1])
[1] 14

For our original task, this means:

> names(read.csv('Acc.csv'))[1] # might produce incorrect results
[1] "п.їAccident_Index"
> names(read.csv('Acc.csv', fileEncoding='UTF-8-BOM'))[1] # correct
[1] "Accident_Index"

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.