Re: read.csv trap

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: read.csv trap

bbolker

  Bump.

  It's been a week since I posted this to r-devel.  Any
thoughts/discussion?  Would R-core be irritated if I submitted a bug report?

  cheers
    Ben


-------- Original Message --------
Subject: read.csv trap
Date: Fri, 04 Feb 2011 11:16:36 -0500
From: Ben Bolker <[hidden email]>
To: [hidden email] <[hidden email]>,  David Earn
<[hidden email]>

  This is not specifically a bug, but an (implicitly/obscurely)
documented behavior of read.csv (or read.table with fill=TRUE) that can
be quite dangerous/confusing for users.  I would love to hear some
discussion from other users and/or R-core about this ...  As always, I
apologize if I have missed some obvious workaround or reason that this
is actually the desired behavior ...

  In a nutshell, when fill=TRUE R guesses the number of columns from the
first 5 rows of the data set.  That's fine, and ?read.table documents this:

   The number of data columns is determined by looking at the first
     five lines of input (or the whole file if it has less than five
     lines), or from the length of ‘col.names’ if it is specified and
     is longer.  This could conceivably be wrong if ‘fill’ or
     ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary.

What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).  I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.

  Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly.  (It assumes you don't already have a file named "test.csv"
in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.

  cheers
    Ben Bolker

===============
writeLines(c("A,B,C,D",
             "1,a,b,c",
             "2,f,g,c",
             "3,a,i,j",
             "4,a,b,c",
             "5,d,e,f",
             "6,g,h,i,j,k,l,m,n"),
           con=file("test.csv"))


read.csv("test.csv")
try(read.csv("test.csv",fill=FALSE))

## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
  ncolnames <- length(colnames)
  maxcols <- max(count.fields(fn,sep=sep,...))
  if (maxcols>ncolnames) {
    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
  }
  ## assumes you don't have any other columns labeled "V[large number]"
  read.csv(fn,...,col.names=colnames)
}

Read.csv("test.csv")

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.csv trap

Ken.Williams-2


On 2/11/11 1:39 PM, "Ben Bolker" <[hidden email]> wrote:

>[snip]
>-------- Original Message --------
>Subject: read.csv trap
>Date: Fri, 04 Feb 2011 11:16:36 -0500
>From: Ben Bolker <[hidden email]>
>To: [hidden email] <[hidden email]>,  David Earn
><[hidden email]>
>
>[snip]
>What is dangerous/confusing is that R silently **wraps** longer lines if
>fill=TRUE (which is the default for read.csv).
>[snip]


Based on your description, I would be very irritated if I encountered the
behavior you describe.  I would consider it a bug, though my opinion
doesn't necessarily count for much.

--
Ken Williams
Senior Research Scientist
Thomson Reuters
Phone: 651-848-7712
[hidden email]
http://labs.thomsonreuters.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.csv trap

Laurent Gatto
In reply to this post by bbolker
On 11 February 2011 19:39, Ben Bolker <[hidden email]> wrote:
>
[snip]
>
> What is dangerous/confusing is that R silently **wraps** longer lines if
> fill=TRUE (which is the default for read.csv).  I encountered this when
> working with a colleague on a long, messy CSV file that had some phantom
> extra fields in some rows, which then turned into empty lines in the
> data frame.
>

As a matter of fact, this is exactly what happened to a colleague of
mine yesterday and caused her quite a bit of trouble. On the other
hand, it could also be considered as a 'bug' in the csv file. Although
no formal specification exist for the csv format, RFC 4180 [1]
indicates that 'each line should contain the same number of fields
throughout the file'.

[1] http://tools.ietf.org/html/rfc4180

Best wishes,

Laurent

>  Here is an example and a workaround that runs count.fields on the
> whole file to find the maximum column length and set col.names
> accordingly.  (It assumes you don't already have a file named "test.csv"
> in your working directory ...)
>
>  I haven't dug in to try to write a patch for this -- I wanted to test
> the waters and see what people thought first, and I realize that
> read.table() is a very complicated piece of code that embodies a lot of
> tradeoffs, so there could be lots of different approaches to trying to
> mitigate this problem. I appreciate very much how hard it is to write a
> robust and general function to read data files, but I also think it's
> really important to minimize the number of traps in read.table(), which
> will often be the first part of R that new users encounter ...
>
>  A quick fix for this might be to allow the number of lines analyzed
> for length to be settable by the user, or to allow a settable 'maxcols'
> parameter, although those would only help in the case where the user
> already knows there is a problem.
>
>  cheers
>    Ben Bolker
>
> ===============
> writeLines(c("A,B,C,D",
>             "1,a,b,c",
>             "2,f,g,c",
>             "3,a,i,j",
>             "4,a,b,c",
>             "5,d,e,f",
>             "6,g,h,i,j,k,l,m,n"),
>           con=file("test.csv"))
>
>
> read.csv("test.csv")
> try(read.csv("test.csv",fill=FALSE))
>
> ## assumes header=TRUE, fill=TRUE; should be a little more careful
> ##  with comment, quote arguments (possibly explicit)
> ## ... contains information about quote, comment.char, sep
> Read.csv <- function(fn,sep=",",...) {
>  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
>  ncolnames <- length(colnames)
>  maxcols <- max(count.fields(fn,sep=sep,...))
>  if (maxcols>ncolnames) {
>    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
>  }
>  ## assumes you don't have any other columns labeled "V[large number]"
>  read.csv(fn,...,col.names=colnames)
> }
>
> Read.csv("test.csv")
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
[ Laurent Gatto | slashhome.be ]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.csv trap

bbolker
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/11/2011 03:37 PM, Laurent Gatto wrote:

> On 11 February 2011 19:39, Ben Bolker <[hidden email]> wrote:
>>
> [snip]
>>
>> What is dangerous/confusing is that R silently **wraps** longer lines if
>> fill=TRUE (which is the default for read.csv).  I encountered this when
>> working with a colleague on a long, messy CSV file that had some phantom
>> extra fields in some rows, which then turned into empty lines in the
>> data frame.
>>
>
> As a matter of fact, this is exactly what happened to a colleague of
> mine yesterday and caused her quite a bit of trouble. On the other
> hand, it could also be considered as a 'bug' in the csv file. Although
> no formal specification exist for the csv format, RFC 4180 [1]
> indicates that 'each line should contain the same number of fields
> throughout the file'.
>
> [1] http://tools.ietf.org/html/rfc4180
>
> Best wishes,
>
> Laurent

  Asserting that the bug is in the CSV file is logically consistent, but
if this is true then the "fill=TRUE" argument (which is only needed when
the lines contain different numbers of fields) should not be allowed.

 I had never seen RFC4180 before -- interesting!  I note especially
points 5-7 which define the handling of double quotation marks (but says
nothing about single quotes or using backslashes as escape characters).

  Dealing with read.[table|csv] seems a bit of an Augean task
<http://en.wikipedia.org/wiki/Augeas> (hmmm, maybe I should write a
parallel document to Burns's _Inferno_ ...)

  cheers
    Ben

>
>>  Here is an example and a workaround that runs count.fields on the
>> whole file to find the maximum column length and set col.names
>> accordingly.  (It assumes you don't already have a file named "test.csv"
>> in your working directory ...)
>>
>>  I haven't dug in to try to write a patch for this -- I wanted to test
>> the waters and see what people thought first, and I realize that
>> read.table() is a very complicated piece of code that embodies a lot of
>> tradeoffs, so there could be lots of different approaches to trying to
>> mitigate this problem. I appreciate very much how hard it is to write a
>> robust and general function to read data files, but I also think it's
>> really important to minimize the number of traps in read.table(), which
>> will often be the first part of R that new users encounter ...
>>
>>  A quick fix for this might be to allow the number of lines analyzed
>> for length to be settable by the user, or to allow a settable 'maxcols'
>> parameter, although those would only help in the case where the user
>> already knows there is a problem.
>>
>>  cheers
>>    Ben Bolker
>>
>> ===============
>> writeLines(c("A,B,C,D",
>>             "1,a,b,c",
>>             "2,f,g,c",
>>             "3,a,i,j",
>>             "4,a,b,c",
>>             "5,d,e,f",
>>             "6,g,h,i,j,k,l,m,n"),
>>           con=file("test.csv"))
>>
>>
>> read.csv("test.csv")
>> try(read.csv("test.csv",fill=FALSE))
>>
>> ## assumes header=TRUE, fill=TRUE; should be a little more careful
>> ##  with comment, quote arguments (possibly explicit)
>> ## ... contains information about quote, comment.char, sep
>> Read.csv <- function(fn,sep=",",...) {
>>  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
>>  ncolnames <- length(colnames)
>>  maxcols <- max(count.fields(fn,sep=sep,...))
>>  if (maxcols>ncolnames) {
>>    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
>>  }
>>  ## assumes you don't have any other columns labeled "V[large number]"
>>  read.csv(fn,...,col.names=colnames)
>> }
>>
>> Read.csv("test.csv")
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq
zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT
=Rb2R
-----END PGP SIGNATURE-----

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: read.csv trap

bbolker
Ben Bolker <bbolker <at> gmail.com> writes:

> On 02/11/2011 03:37 PM, Laurent Gatto wrote:
> > On 11 February 2011 19:39, Ben Bolker <bbolker <at> gmail.com> wrote:
> >>
> > [snip]
> >>


  Bump.  Is there any opinion about this from R-core??
Will I be scolded if I submit this as a bug ... ??


> >> What is dangerous/confusing is that R silently **wraps** longer lines if
> >> fill=TRUE (which is the default for read.csv).  I encountered this when
> >> working with a colleague on a long, messy CSV file that had some phantom
> >> extra fields in some rows, which then turned into empty lines in the
> >> data frame.
> >>

  [snip snip]

> >>  Here is an example and a workaround that runs count.fields on the
> >> whole file to find the maximum column length and set col.names
> >> accordingly.  (It assumes you don't already have a file named "test.csv"
> >> in your working directory ...)
> >>
> >>  I haven't dug in to try to write a patch for this -- I wanted to test
> >> the waters and see what people thought first, and I realize that
> >> read.table() is a very complicated piece of code that embodies a lot of
> >> tradeoffs, so there could be lots of different approaches to trying to
> >> mitigate this problem. I appreciate very much how hard it is to write a
> >> robust and general function to read data files, but I also think it's
> >> really important to minimize the number of traps in read.table(), which
> >> will often be the first part of R that new users encounter ...
> >>
> >>  A quick fix for this might be to allow the number of lines analyzed
> >> for length to be settable by the user, or to allow a settable 'maxcols'
> >> parameter, although those would only help in the case where the user
> >> already knows there is a problem.
> >>
> >>  cheers
> >>    Ben Bolker
> >>
===============
writeLines(c("A,B,C,D",
            "1,a,b,c",
            "2,f,g,c",
            "3,a,i,j",
            "4,a,b,c",
            "5,d,e,f",
            "6,g,h,i,j,k,l,m,n"),
          con=file("test.csv"))
> >>
> >>
read.csv("test.csv")
try(read.csv("test.csv",fill=FALSE))
> >>
## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
 colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
 ncolnames <- length(colnames)
 maxcols <- max(count.fields(fn,sep=sep,...))
 if (maxcols>ncolnames) {
   colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
 }
 ## assumes you don't have any other columns labeled "V[large number]"
 read.csv(fn,...,col.names=colnames)
}

Read.csv("test.csv")

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel