Unexpected behaviour from read.table

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Unexpected behaviour from read.table

Michael-5
I’ve been struggling with seemingly ‘corrupt’ data.frames for a few days, and believe I’ve narrowed the problem down to some odd behaviour from read.table

I receive a tab delimited file from an external provider where strings are encoded as =“content”. Not sure why, perhaps as most users open it in Excel.
My specific issue is that trailing spaces in any of the strings are causing strange results from read.table

# No trailing spaces
read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01\"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t’)
      V1    V2
1     ID Value
2 =Total  1000
3  =CJ01   550
4  =CF02   450

# Now with trailing spaces in line 3
read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01   \"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t')
        V1    V2
1    =CF02   450
2       ID Value
3   =Total  1000
4 =CJ01      550
5    =CF02   450

I solved my specific problem by setting quote=‘’, and extracting the string content after calling read.table. As my original code had header=TRUE, I was finding random rows were being used as column names!

Flagging a potential issue with read.table, although I can easily accept I'm missing something obvious here.

Best,
 Michael

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)  / x86_64-pc-linux-gnu (64-bit)
Running under: macOS High Sierra 10.13.2 /  Ubuntu 16.04.3 LTS







        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Unexpected behaviour from read.table

Peter Dalgaard-2
This looks like a bug. Specifically, inside read.table

    lines <- .External(C_readtablehead, file, nlines, comment.char,
        blank.lines.skip, quote, sep, skipNul)

returns "lines" as

[1] "ID\tValue"                         "=\"Total\"\t1000"                
[3] "=\"CJ01   \"\t550\n=\"CF02\"\t450"

Notice the embedded \n in the 3rd line. I.e., there are really 4 lines there. This gets pushed back twice and the first 3 (not 4) lines get read again as part of the header logic. Then when it comes to reading the data proper, the 4th line has ended up duplicated as the top row...

As you suggest, it seems that something is up with the quote matching logic.

-pd


> On 4 Feb 2018, at 23:45 , Michael <[hidden email]> wrote:
>
> I’ve been struggling with seemingly ‘corrupt’ data.frames for a few days, and believe I’ve narrowed the problem down to some odd behaviour from read.table
>
> I receive a tab delimited file from an external provider where strings are encoded as =“content”. Not sure why, perhaps as most users open it in Excel.
> My specific issue is that trailing spaces in any of the strings are causing strange results from read.table
>
> # No trailing spaces
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01\"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t’)
>      V1    V2
> 1     ID Value
> 2 =Total  1000
> 3  =CJ01   550
> 4  =CF02   450
>
> # Now with trailing spaces in line 3
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01   \"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t')
>        V1    V2
> 1    =CF02   450
> 2       ID Value
> 3   =Total  1000
> 4 =CJ01      550
> 5    =CF02   450
>
> I solved my specific problem by setting quote=‘’, and extracting the string content after calling read.table. As my original code had header=TRUE, I was finding random rows were being used as column names!
>
> Flagging a potential issue with read.table, although I can easily accept I'm missing something obvious here.
>
> Best,
> Michael
>
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-apple-darwin15.6.0 (64-bit)  / x86_64-pc-linux-gnu (64-bit)
> Running under: macOS High Sierra 10.13.2 /  Ubuntu 16.04.3 LTS
>
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.