read.delim()

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

read.delim()

Doran, Harold
I am reading in a very large file with names in it and R is truncating the number of rows it reads in. The separator in this file is a pipe '|' and so I use

dat <- read.delim('pathToMyFile', header= TRUE, sep='|')

It turns out that it is reading up to row 61145 and stopping and I think I see why, but am not sure of the best solution to this problem. I see the name of the person in the next row has a quote in it, such as:

Joe Sm"ith

I *think* this is causing a problem in the read in. In fact, whenever I use


Ø  tail(dat)

Ø  or dat[61145,]

R crashes.

But, it doesn't crash when I use head(dat) or index any other row. I could change my raw data and manually delete this ". However, is there another solution within the args of read.delim that would be useful as a solution such that I would not have to manually change my raw data

Harold

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: read.delim()

Phil Spector
Harold -
    If there aren't any true quoted fields in the file, you
could  pass the quote="" option to read.delim().

  - Phil Spector
  Statistical Computing Facility
  Department of Statistics
  UC Berkeley
  [hidden email]

On Wed, 28 Jul 2010, Doran, Harold wrote:

> I am reading in a very large file with names in it and R is truncating the number of rows it reads in. The separator in this file is a pipe '|' and so I use
>
> dat <- read.delim('pathToMyFile', header= TRUE, sep='|')
>
> It turns out that it is reading up to row 61145 and stopping and I think I see why, but am not sure of the best solution to this problem. I see the name of the person in the next row has a quote in it, such as:
>
> Joe Sm"ith
>
> I *think* this is causing a problem in the read in. In fact, whenever I use
>
>
> ?  tail(dat)
>
> ?  or dat[61145,]
>
> R crashes.
>
> But, it doesn't crash when I use head(dat) or index any other row. I could change my raw data and manually delete this ". However, is there another solution within the args of read.delim that would be useful as a solution such that I would not have to manually change my raw data
>
> Harold
>
> [[alternative HTML version deleted]]
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: read.delim()

Doran, Harold
In reply to this post by Doran, Harold
Thank you, Phil. Unfortunately, there are quotes used properly elsewhere.

----- Original Message -----
From: Phil Spector <[hidden email]>
To: Doran, Harold
Cc: [hidden email] <[hidden email]>
Sent: Wed Jul 28 18:29:32 2010
Subject: Re: [R] read.delim()

Harold -
    If there aren't any true quoted fields in the file, you
could  pass the quote="" option to read.delim().

  - Phil Spector
  Statistical Computing Facility
  Department of Statistics
  UC Berkeley
  [hidden email]

On Wed, 28 Jul 2010, Doran, Harold wrote:

> I am reading in a very large file with names in it and R is truncating the number of rows it reads in. The separator in this file is a pipe '|' and so I use
>
> dat <- read.delim('pathToMyFile', header= TRUE, sep='|')
>
> It turns out that it is reading up to row 61145 and stopping and I think I see why, but am not sure of the best solution to this problem. I see the name of the person in the next row has a quote in it, such as:
>
> Joe Sm"ith
>
> I *think* this is causing a problem in the read in. In fact, whenever I use
>
>
> ?  tail(dat)
>
> ?  or dat[61145,]
>
> R crashes.
>
> But, it doesn't crash when I use head(dat) or index any other row. I could change my raw data and manually delete this ". However, is there another solution within the args of read.delim that would be useful as a solution such that I would not have to manually change my raw data
>
> Harold
>
> [[alternative HTML version deleted]]
>
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: read.delim()

bbolker
Doran, Harold <HDoran <at> air.org> writes:

>
> Thank you, Phil. Unfortunately, there are quotes used properly elsewhere.
> ----- Original Message -----
> From: Phil Spector <spector <at> stat.berkeley.edu>
> To: Doran, Harold
> Cc: r-help <at> r-project.org <r-help <at> r-project.org>
> Sent: Wed Jul 28 18:29:32 2010
> Subject: Re: [R] read.delim()
>
> Harold -
>     If there aren't any true quoted fields in the file, you
> could  pass the quote="" option to read.delim().
>
>   - Phil Spector
>   Statistical Computing Facility
>   Department of Statistics
>   UC Berkeley
>   spector <at> stat.berkeley.edu
>
> On Wed, 28 Jul 2010, Doran, Harold wrote:
>
> > I am reading in a very large file with names in it and R is truncating the
number of rows it reads in. The
> separator in this file is a pipe '|' and so I use
> >
> > dat <- read.delim('pathToMyFile', header= TRUE, sep='|')
> >
> > It turns out that it is reading up to row 61145 and stopping and I think I
see why, but am not sure of the best
> solution to this problem. I see the name of the person in the next row has a
quote in it, such as:

> >
> > Joe Sm"ith
> >
> > I *think* this is causing a problem in the read in. In fact, whenever I use
> >
> >
> > ?  tail(dat)
> >
> > ?  or dat[61145,]
> >
> > R crashes.
> >
> > But, it doesn't crash when I use head(dat) or index any other row.
> I could change my raw data and manually
> delete this ". However, is there another solution within the
> args of read.delim that would be useful as a
> solution such that I would not have to manually change my raw data
> >
> > Harold


  Does R actually 'crash' (i.e., stop/segmentation fault/etc.)?
Or does it just give you an error message?

  Assuming that the bad cases are always represented by a *single*
quotation mark on a line, you could find them by reading in the
whole file with r <- readLines(...) [assuming the file is small enough
to suck into memory whole] and do something like

 sapply(strsplit(r,""),function(x) sum(x=="\""))

to find the bad lines.  There are certainly many more pathological
cases (what if there are (good) paired quotes and (bad) unpaired
quotes on the same line?  What if there are two (bad) unpaired quotes
on the same line?

  Sounds like it's time to do some manual editing.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.