[R] Row limit for read.table

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[R] Row limit for read.table

Frank McCown
I have been trying to read in a large data set using read.table, but
I've only been able to grab the first 50,871 rows of the total 122,269 rows.

 > f <-
read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
header=TRUE, nrows=123000, comment.char="", sep="\t")
 > length(f$change_rate)
[1] 50871

 From searching the email archives, I believe this is due to size limits
of a data frame.  So...

1) Why doesn't read.table give a proper warning when it doesn't place
every read item into a data frame?

2) Why isn't there a parameter to read.table that allows the user to
specify which columns s/he is interested in?  This functionality would
allow extraneous columns to be ignored which would improve memory usage.

I've already made a work-around by loading the table into mysql and
doing a select on the 2 columns I need.  I just wonder why the above 2
points aren't implemented.  Maybe they are and I'm totally missing it.

Thanks,
Frank


--
Frank McCown
Old Dominion University
http://www.cs.odu.edu/~fmccown/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] Row limit for read.table

Peter Dalgaard
Frank McCown wrote:

> I have been trying to read in a large data set using read.table, but
> I've only been able to grab the first 50,871 rows of the total 122,269 rows.
>
>  > f <-
> read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
> header=TRUE, nrows=123000, comment.char="", sep="\t")
>  > length(f$change_rate)
> [1] 50871
>
>  From searching the email archives, I believe this is due to size limits
> of a data frame.  So...
>  
I think you believe wrongly...
> 1) Why doesn't read.table give a proper warning when it doesn't place
> every read item into a data frame?
>  
That isn't the problem, it is a somewhat obscure interaction between
quote= and sep= that is doing you in. Remove the sep="\t" and/or add
quote="" and your life should be easier.
> 2) Why isn't there a parameter to read.table that allows the user to
> specify which columns s/he is interested in?  This functionality would
> allow extraneous columns to be ignored which would improve memory usage.
>
>  
There is!  check out colClasses

> cc <- rep("NULL",5)
> cc[4:5] <- NA
> f <-
read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
header=TRUE, sep="\t", quote="", colClasses=cc)
> str(f)
'data.frame':   122271 obs. of  2 variables:
 $ recovered  : Factor w/ 5 levels "changed","identical",..: 5 3 3 3 2 2
2 2 1 2 ...
 $ change_rate: num  1 0 0 1 0 0 0 0 0 0 ...

> I've already made a work-around by loading the table into mysql and
> doing a select on the 2 columns I need.  I just wonder why the above 2
> points aren't implemented.  Maybe they are and I'm totally missing it.
>
> Thanks,
> Frank
>
>
>  


--
   O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] Row limit for read.table

Martin Becker
In reply to this post by Frank McCown
Frank McCown schrieb:

> I have been trying to read in a large data set using read.table, but
> I've only been able to grab the first 50,871 rows of the total 122,269 rows.
>
>  > f <-
> read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
> header=TRUE, nrows=123000, comment.char="", sep="\t")
>  > length(f$change_rate)
> [1] 50871
>
>  From searching the email archives, I believe this is due to size limits
> of a data frame.  So...
>
>  
It is not due to size limits, see below.
> 1) Why doesn't read.table give a proper warning when it doesn't place
> every read item into a data frame?
>
>  
In your case, read.table behaves as documented.
The ' - character is one of the standard quoting characters. Some (but
very few) of the entrys contain single ' chars, so sometimes more than
ten thousand lines are just treated as a single entry. Try using
quote="" to disable quoting, as documented on the help page:

f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
header=TRUE, nrows=123000, comment.char="", sep="\t",quote="")

length(f$change_rate)
[1] 122271

> 2) Why isn't there a parameter to read.table that allows the user to
> specify which columns s/he is interested in?  This functionality would
> allow extraneous columns to be ignored which would improve memory usage.
>
>  
There is (colClasses, works as documented). Try

 f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",

+ header=TRUE, nrows=123000, comment.char="",
sep="\t",quote="",colClasses=c("character","NULL","NULL","NULL","NULL"))
 > dim(f)
[1] 122271      1

> I've already made a work-around by loading the table into mysql and
> doing a select on the 2 columns I need.  I just wonder why the above 2
> points aren't implemented.  Maybe they are and I'm totally missing it.
>
>  
Did you read the help page?
> Thanks,
> Frank
>
>
>  
Regards,
   Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] Row limit for read.table

Wladimir Eremeev
In reply to this post by Frank McCown
The problem is somewhere in the file, probably with tab characters, as removing
sep="" from your call does the job.

> dfr<-read.table("Tchange_rates_crawled.dat",header=TRUE)
> str(dfr)
'data.frame':   122271 obs. of  5 variables:
[skipped]
> dfr<-
read.table("Tchange_rates_crawled.dat",header=TRUE,stringsAsFactors=FALSE)
> str(dfr)
'data.frame':   122271 obs. of  5 variables:
[skipped]

R has no limitations you're talking about. A couple hours ago my R has
successfully read in 460000 rows of data from a text file in a data frame using
read.table.
I have also processed even larger data sets (more than 500000 rows).

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [R] Row limit for read.table

Frank McCown
In reply to this post by Martin Becker
> In your case, read.table behaves as documented.
> The ' - character is one of the standard quoting characters. Some (but
> very few) of the entrys contain single ' chars, so sometimes more than
> ten thousand lines are just treated as a single entry. Try using
> quote="" to disable quoting, as documented on the help page:
>
> f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
> header=TRUE, nrows=123000, comment.char="", sep="\t",quote="")
>
> length(f$change_rate)
> [1] 122271


So either adding quote="" works or removing sep="\t" (and not using
quote) works.  It seems an odd side-effect that specifying the separator
changes the default behavior of quoting (because of the ' character).  I
don't see that association made in the help file.


> There is (colClasses, works as documented). Try
>
> f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat",
> + header=TRUE, nrows=123000, comment.char="",
> sep="\t",quote="",colClasses=c("character","NULL","NULL","NULL","NULL"))
>  > dim(f)
> [1] 122271      1

> Did you read the help page?

Of course I did.  For me the definition of colClasses wasn't clear...
"A vector of classes to be assumed for the columns" didn't seem to be
the same thing as "the columns you would like to be read."  I may have
made the association if the help page had contained a simple example of
using colClasses.

Thanks for the help,
Frank


--
Frank McCown
Old Dominion University
http://www.cs.odu.edu/~fmccown/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.