Handling large dataset & dataframe

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling large dataset & dataframe

Sachin J
Hi,
   
  I have a dataset consisting of 350,000 rows and 266 columns.  Out of 266 columns 250 are dummy variable columns. I am trying to read this data set into R dataframe object but unable to do it due to memory size limitations (object size created is too large to handle in R).  Is there a way to handle such a large dataset in R.
   
  My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
   
  Any pointers would be of great help.
   
  TIA
  Sachin

               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

RKoenker
You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix,  but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...


url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    [hidden email]            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On Apr 24, 2006, at 12:41 PM, Sachin J wrote:

> Hi,
>
>   I have a dataset consisting of 350,000 rows and 266 columns.  Out  
> of 266 columns 250 are dummy variable columns. I am trying to read  
> this data set into R dataframe object but unable to do it due to  
> memory size limitations (object size created is too large to handle  
> in R).  Is there a way to handle such a large dataset in R.
>
>   My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
>   Any pointers would be of great help.
>
>   TIA
>   Sachin
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
Hi Roger,
   
  I want to carry out regression analysis on this dataset. So I believe I can't read the dataset in chunks. Any other solution?
   
  TIA
  Sachin
 

roger koenker <[hidden email]> wrote:
  You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix, but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...


url: www.econ.uiuc.edu/~roger Roger Koenker
email [hidden email] Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820


On Apr 24, 2006, at 12:41 PM, Sachin J wrote:

> Hi,
>
> I have a dataset consisting of 350,000 rows and 266 columns. Out
> of 266 columns 250 are dummy variable columns. I am trying to read
> this data set into R dataframe object but unable to do it due to
> memory size limitations (object size created is too large to handle
> in R). Is there a way to handle such a large dataset in R.
>
> My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
> Any pointers would be of great help.
>
> TIA
> Sachin
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html



               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Gabor Grothendieck
In reply to this post by Sachin J
You just need the much smaller cross product matrix X'X and vector X'Y so you
can build those up as you read the data in in chunks.


On 4/24/06, Sachin J <[hidden email]> wrote:

> Hi,
>
>  I have a dataset consisting of 350,000 rows and 266 columns.  Out of 266 columns 250 are dummy variable columns. I am trying to read this data set into R dataframe object but unable to do it due to memory size limitations (object size created is too large to handle in R).  Is there a way to handle such a large dataset in R.
>
>  My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
>  Any pointers would be of great help.
>
>  TIA
>  Sachin
>
>
> ---------------------------------
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Richard M. Heiberger
In reply to this post by Sachin J
Where is the excess size being identified?  Is it the read? or in the lm().

If it is in the reading of the data, then why are you reading the dummy variables?
Would it make sense to read a single column of a factor instead of 80 columns
of dummy variables?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
Hi Richard:
   
  Even if I dont read the dummy var columns, i.e. just read the original dataset with 350,000 rows and 16 columns, when I try to run the regression - using
   
  >lm(y ~ c1 + factor(c2) + factor(c3) ) ; where c2, c3 are dummy variables,
   
  The procedure fails saying not enough memory. But,
   
  > lm(y ~ c1 + factor(c2) ) works fine.
   
  Any thoughts.
   
  Thanks
  Sachin

"Richard M. Heiberger" <[hidden email]> wrote:
  Where is the excess size being identified? Is it the read? or in the lm().

If it is in the reading of the data, then why are you reading the dummy variables?
Would it make sense to read a single column of a factor instead of 80 columns
of dummy variables?


               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
In reply to this post by Gabor Grothendieck
Gabor:
   
  Can you elaborate more.
   
  Thanx
  Sachin

Gabor Grothendieck <[hidden email]> wrote:
  You just need the much smaller cross product matrix X'X and vector X'Y so you
can build those up as you read the data in in chunks.


On 4/24/06, Sachin J wrote:

> Hi,
>
> I have a dataset consisting of 350,000 rows and 266 columns. Out of 266 columns 250 are dummy variable columns. I am trying to read this data set into R dataframe object but unable to do it due to memory size limitations (object size created is too large to handle in R). Is there a way to handle such a large dataset in R.
>
> My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
> Any pointers would be of great help.
>
> TIA
> Sachin
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>


               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Liaw, Andy
In reply to this post by Sachin J
Instead of reading the entire data in at once, you read a chunk at a time,
and compute X'X and X'y on that chunk, and accumulate (i.e., add) them.
There are examples in "S Programming", taken from independent replies by the
two authors to a post on S-news, if I remember correctly.

Andy

From: Sachin J

>
> Gabor:
>    
>   Can you elaborate more.
>    
>   Thanx
>   Sachin
>
> Gabor Grothendieck <[hidden email]> wrote:
>   You just need the much smaller cross product matrix X'X and
> vector X'Y so you can build those up as you read the data in
> in chunks.
>
>
> On 4/24/06, Sachin J wrote:
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out of
> > 266 columns 250 are dummy variable columns. I am trying to
> read this
> > data set into R dataframe object but unable to do it due to memory
> > size limitations (object size created is too large to
> handle in R). Is
> > there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
Hi Andy:
   
  I searched through R-archive to find out how to handle large data set using readLines and other related R functions. I couldn't find any single post which elaborates the process. Can you provide me with an example or any pointers to the postings elaborating the process.
   
  Thanx in advance
  Sachin
   
 
"Liaw, Andy" <[hidden email]> wrote:
  Instead of reading the entire data in at once, you read a chunk at a time,
and compute X'X and X'y on that chunk, and accumulate (i.e., add) them.
There are examples in "S Programming", taken from independent replies by the
two authors to a post on S-news, if I remember correctly.

Andy

From: Sachin J

>
> Gabor:
>
> Can you elaborate more.
>
> Thanx
> Sachin
>
> Gabor Grothendieck wrote:
> You just need the much smaller cross product matrix X'X and
> vector X'Y so you can build those up as you read the data in
> in chunks.
>
>
> On 4/24/06, Sachin J wrote:
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out of
> > 266 columns 250 are dummy variable columns. I am trying to
> read this
> > data set into R dataframe object but unable to do it due to memory
> > size limitations (object size created is too large to
> handle in R). Is
> > there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>


------------------------------------------------------------------------------

------------------------------------------------------------------------------


               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Mark Stephens
In reply to this post by Sachin J
Sachin,
With your dummies stored as integer,  the size of your object would appear
to be 350000 * (4*250 + 8*16) bytes =  376MB.
You said "PC" but did not provide R version information,  assuming windows
then ...
With 1GB RAM you should be able to load a 376MB object into memory.   If you
can store the dummies as 'raw' then object size is only 126MB.
You don't say how you attempted to load the data. Assuming your input data
is in text file (or can be) have you tried scan()? Setup the 'what' argument
with length 266 and make sure the dummy column are set to integer() or
raw().  Then   x = scan(...);  class(x)=" data.frame".
What is the result of memory.limit()?  If it is 256MB or 512MB, then try
starting R with --max-mem-size=800M  (I forget the syntax exactly). Leave a
bit of room below 1GB.  Once the object is in memory R may need to copy it
once, or a few times. You may need to close all other apps in memory,  or
send them to swap.
I don't really see why your data should not fit into the memory you have.
Purchasing an extra 1GB may help.  Knowing the object size calculation (as
above) should help you guage whether it is worth it.
Have you used process monitor to see the memory growing as R loads the
data?  This can be useful.
If all the above fails,  then consider 64-bit and purchasing as much memory
as you can afford. R can use over 64GB RAM+ on 64bit machines. Maybe you can
hire some time on a 64-bit server farm - i heard its quite cheap but never
tried it myself.  You shouldn't need to go that far with this data set
though.
Hope this helps,
Mark


Hi Roger,

 I want to carry out regression analysis on this dataset. So I believe I
can't read the dataset in chunks. Any other solution?

 TIA
 Sachin


roger koenker < [hidden email]> wrote:
 You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix, but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...


url: www.econ.uiuc.edu/~roger Roger Koenker
email [hidden email] Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820


On Apr 24, 2006, at 12:41 PM, Sachin J wrote:

> Hi,
>
> I have a dataset consisting of 350,000 rows and 266 columns. Out
> of 266 columns 250 are dummy variable columns. I am trying to read
> this data set into R dataframe object but unable to do it due to
> memory size limitations (object size created is too large to handle
> in R). Is there a way to handle such a large dataset in R.
>
> My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
> Any pointers would be of great help.
>
> TIA
> Sachin
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
Mark:
   
  Here is the information I didn't provide in my earlier post. R version is R2.2.1 running on Windows XP.  My dataset has 16 variables with following data type.
  ColNumber:   1              2              3  .......16
  Datatypes:
    "numeric","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","character","character","numeric","numeric","numeric","numeric","numeric","numeric","numeric"
   
  Variable (2) which is numeric and variables denoted as character are to be treated as dummy variables in the regression.
   
  Search in R help list  suggested I can use read.csv with colClasses option also instead of using scan() and then converting it to dataframe as you suggested. I am trying both these methods but unable to resolve syntactical error.
   
  >coltypes<- c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
   
  >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses = coltypes, strip.white=TRUE)
   
  ERROR: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
        scan() expected 'a real', got 'V1'
   
  No idea whats the problem.
   
  AS PER YOUR SUGGESTION I TRIED scan() as follows:
   
  >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
  >x<-scan(file = "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
  >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
  >x<-as.data.frame(x)
   
  This is working fine but x has no data in it and contains
  > x
   
   [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6  NA..7  NA..8  NA..9  NA..10 NA..11
[14] NA..12 NA..13 NA..14 NA..15 NA..16
<0 rows> (or 0-length row.names)
   
  Please let me know how to properly use scan or colClasses option.
   
  Sachin

   
   
 

Mark Stephens <[hidden email]> wrote:
  Sachin,
With your dummies stored as integer, the size of your object would appear
to be 350000 * (4*250 + 8*16) bytes = 376MB.
You said "PC" but did not provide R version information, assuming windows
then ...
With 1GB RAM you should be able to load a 376MB object into memory. If you
can store the dummies as 'raw' then object size is only 126MB.
You don't say how you attempted to load the data. Assuming your input data
is in text file (or can be) have you tried scan()? Setup the 'what' argument
with length 266 and make sure the dummy column are set to integer() or
raw(). Then x = scan(...); class(x)=" data.frame".
What is the result of memory.limit()? If it is 256MB or 512MB, then try
starting R with --max-mem-size=800M (I forget the syntax exactly). Leave a
bit of room below 1GB. Once the object is in memory R may need to copy it
once, or a few times. You may need to close all other apps in memory, or
send them to swap.
I don't really see why your data should not fit into the memory you have.
Purchasing an extra 1GB may help. Knowing the object size calculation (as
above) should help you guage whether it is worth it.
Have you used process monitor to see the memory growing as R loads the
data? This can be useful.
If all the above fails, then consider 64-bit and purchasing as much memory
as you can afford. R can use over 64GB RAM+ on 64bit machines. Maybe you can
hire some time on a 64-bit server farm - i heard its quite cheap but never
tried it myself. You shouldn't need to go that far with this data set
though.
Hope this helps,
Mark


Hi Roger,

I want to carry out regression analysis on this dataset. So I believe I
can't read the dataset in chunks. Any other solution?

TIA
Sachin


roger koenker < [hidden email]> wrote:
You can read chunks of it at a time and store it in sparse matrix
form using the packages SparseM or Matrix, but then you need
to think about what you want to do with it.... least squares sorts
of things are ok, but other options are somewhat limited...


url: www.econ.uiuc.edu/~roger Roger Koenker
email [hidden email] Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820


On Apr 24, 2006, at 12:41 PM, Sachin J wrote:

> Hi,
>
> I have a dataset consisting of 350,000 rows and 266 columns. Out
> of 266 columns 250 are dummy variable columns. I am trying to read
> this data set into R dataframe object but unable to do it due to
> memory size limitations (object size created is too large to handle
> in R). Is there a way to handle such a large dataset in R.
>
> My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
>
> Any pointers would be of great help.
>
> TIA
> Sachin
>

[[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


               
---------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Mark Stephens
From ?scan: "the *type* of what gives the type of data to be read".
So list(integer(), integer(), double(), raw(), ...)
In your code all columns are being read as character regardless of the
contents of the character vector.

I have to admit that I have added the *'s in *type*.  I have been caught out
by this too.  Its not the most convenient way to specify the types of a
large number of columns either.  As you have a lot of columns you might want
to do something like this:  as.list(rep(integer(1),250)), assuming your
dummies are together, to save typing.  Also storage.mode() is useful to tell
you the precise type (and therefore size) of an object e.g. sapply(coltypes,
storage.mode) is actually the types scan() will use.  Note that 'numeric'
could be 'double' or 'integer' which are important in your case to fit
inside the 1GB limit, because 'integer' (4 bytes) is half 'double' (8
bytes).

Perhaps someone on r-devel could enhance the documentation to make "type"
stand out in capitals in bold in help(scan)?  Or maybe scan could be clever
enough to accept a character vector 'what'.  Or maybe I'm missing a good
reason why this isn't possible - anyone? How about allowing a character
vector length one, with each character representing the type of that column
e.g.  what="IIIIDDCD" would mean 4 integers followed by 2 double's followed
by a character column, followed finally by a double column,  8 columns in
total.  Probably someone somewhere has done that already, but I'm not aware
anyone has wrapped it up conveniently?

On 25/04/06, Sachin J <[hidden email]> wrote:

>
>  Mark:
>
> Here is the information I didn't provide in my earlier post. R version is
> R2.2.1 running on Windows XP.  My dataset has 16 variables with following
> data type.
> ColNumber:   1              2              3  .......16
> Datatypes:
>
> "numeric","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","character","character","numeric","numeric","numeric","numeric","numeric","numeric","numeric"
>
> Variable (2) which is numeric and variables denoted as character are to be
> treated as dummy variables in the regression.
>
> Search in R help list  suggested I can use read.csv with colClasses option
> also instead of using scan() and then converting it to dataframe as you
> suggested. I am trying both these methods but unable to resolve syntactical
> error.
>
> >coltypes<-
> c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
>
> >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses = coltypes,
> strip.white=TRUE)
>
> ERROR: Error in scan(file = file, what = what, sep = sep, quote = quote,
> dec = dec,  :
>         scan() expected 'a real', got 'V1'
>
> No idea whats the problem.
>
> AS PER YOUR SUGGESTION I TRIED scan() as follows:
>
>
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
> >x<-scan(file = "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
>
> >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> >x<-as.data.frame(x)
>
> This is working fine but x has no data in it and contains
> > x
>
>  [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6  NA..7  NA..8
> NA..9  NA..10 NA..11
> [14] NA..12 NA..13 NA..14 NA..15 NA..16
> <0 rows> (or 0-length row.names)
>
> Please let me know how to properly use scan or colClasses option.
>
> Sachin
>
>
>
>
>
> *Mark Stephens <[hidden email]>* wrote:
>
> Sachin,
> With your dummies stored as integer, the size of your object would appear
> to be 350000 * (4*250 + 8*16) bytes = 376MB.
> You said "PC" but did not provide R version information, assuming windows
> then ...
> With 1GB RAM you should be able to load a 376MB object into memory. If you
> can store the dummies as 'raw' then object size is only 126MB.
> You don't say how you attempted to load the data. Assuming your input data
> is in text file (or can be) have you tried scan()? Setup the 'what'
> argument
> with length 266 and make sure the dummy column are set to integer() or
> raw(). Then x = scan(...); class(x)=" data.frame".
> What is the result of memory.limit()? If it is 256MB or 512MB, then try
> starting R with --max-mem-size=800M (I forget the syntax exactly). Leave a
> bit of room below 1GB. Once the object is in memory R may need to copy it
> once, or a few times. You may need to close all other apps in memory, or
> send them to swap.
> I don't really see why your data should not fit into the memory you have.
> Purchasing an extra 1GB may help. Knowing the object size calculation (as
> above) should help you guage whether it is worth it.
> Have you used process monitor to see the memory growing as R loads the
> data? This can be useful.
> If all the above fails, then consider 64-bit and purchasing as much memory
> as you can afford. R can use over 64GB RAM+ on 64bit machines. Maybe you
> can
> hire some time on a 64-bit server farm - i heard its quite cheap but never
> tried it myself. You shouldn't need to go that far with this data set
> though.
> Hope this helps,
> Mark
>
>
> Hi Roger,
>
> I want to carry out regression analysis on this dataset. So I believe I
> can't read the dataset in chunks. Any other solution?
>
> TIA
> Sachin
>
>
> roger koenker < [hidden email]> wrote:
> You can read chunks of it at a time and store it in sparse matrix
> form using the packages SparseM or Matrix, but then you need
> to think about what you want to do with it.... least squares sorts
> of things are ok, but other options are somewhat limited...
>
>
> url: www.econ.uiuc.edu/~roger Roger Koenker
> email [hidden email] Department of Economics
> vox: 217-333-4558 University of Illinois
> fax: 217-244-6678 Champaign, IL 61820
>
>
> On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
>
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out
> > of 266 columns 250 are dummy variable columns. I am trying to read
> > this data set into R dataframe object but unable to do it due to
> > memory size limitations (object size created is too large to handle
> > in R). Is there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>
>
>  ------------------------------
> Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates
> starting at 1¢/min.
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.com/evt=39666/*http://beta.messenger.yahoo.com>
>
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Liaw, Andy
In reply to this post by Sachin J
Much easier to use colClasses in read.table, and in many cases just as fast
(or even faster).

Andy

From: Mark Stephens

>
> From ?scan: "the *type* of what gives the type of data to be
> read". So list(integer(), integer(), double(), raw(), ...) In
> your code all columns are being read as character regardless
> of the contents of the character vector.
>
> I have to admit that I have added the *'s in *type*.  I have
> been caught out by this too.  Its not the most convenient way
> to specify the types of a large number of columns either.  As
> you have a lot of columns you might want to do something like
> this:  as.list(rep(integer(1),250)), assuming your dummies
> are together, to save typing.  Also storage.mode() is useful
> to tell you the precise type (and therefore size) of an
> object e.g. sapply(coltypes,
> storage.mode) is actually the types scan() will use.  Note
> that 'numeric' could be 'double' or 'integer' which are
> important in your case to fit inside the 1GB limit, because
> 'integer' (4 bytes) is half 'double' (8 bytes).
>
> Perhaps someone on r-devel could enhance the documentation to
> make "type" stand out in capitals in bold in help(scan)?  Or
> maybe scan could be clever enough to accept a character
> vector 'what'.  Or maybe I'm missing a good reason why this
> isn't possible - anyone? How about allowing a character
> vector length one, with each character representing the type
> of that column e.g.  what="IIIIDDCD" would mean 4 integers
> followed by 2 double's followed by a character column,
> followed finally by a double column,  8 columns in total.  
> Probably someone somewhere has done that already, but I'm not
> aware anyone has wrapped it up conveniently?
>
> On 25/04/06, Sachin J <[hidden email]> wrote:
> >
> >  Mark:
> >
> > Here is the information I didn't provide in my earlier
> post. R version
> > is R2.2.1 running on Windows XP.  My dataset has 16 variables with
> > following data type.
> > ColNumber:   1              2              3  .......16
> > Datatypes:
> >
> >
> "numeric","numeric","numeric","numeric","numeric","numeric","character
> >
> ","numeric","numeric","character","character","numeric","numeric","num
> > eric","numeric","numeric","numeric","numeric"
> >
> > Variable (2) which is numeric and variables denoted as
> character are
> > to be treated as dummy variables in the regression.
> >
> > Search in R help list  suggested I can use read.csv with colClasses
> > option also instead of using scan() and then converting it to
> > dataframe as you suggested. I am trying both these methods
> but unable
> > to resolve syntactical error.
> >
> > >coltypes<-
> >
> c("numeric","factor","numeric","numeric","numeric","numeric","factor",
> >
> "numeric","numeric","factor","factor","numeric","numeric","numeric","n
> > umeric","numeric","numeric","numeric")
> >
> > >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses =
> > >coltypes,
> > strip.white=TRUE)
> >
> > ERROR: Error in scan(file = file, what = what, sep = sep, quote =
> > quote, dec = dec,  :
> >         scan() expected 'a real', got 'V1'
> >
> > No idea whats the problem.
> >
> > AS PER YOUR SUGGESTION I TRIED scan() as follows:
> >
> >
> >
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric
> >
> >","factor","numeric","numeric","factor","factor","numeric","n
> umeric","numeric","numeric","numeric","numeric","numeric")
> > >x<-scan(file =
> "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
> >
> > >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> > >x<-as.data.frame(x)
> >
> > This is working fine but x has no data in it and contains
> > > x
> >
> >  [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6
>  NA..7  NA..8
> > NA..9  NA..10 NA..11
> > [14] NA..12 NA..13 NA..14 NA..15 NA..16
> > <0 rows> (or 0-length row.names)
> >
> > Please let me know how to properly use scan or colClasses option.
> >
> > Sachin
> >
> >
> >
> >
> >
> > *Mark Stephens <[hidden email]>* wrote:
> >
> > Sachin,
> > With your dummies stored as integer, the size of your object would
> > appear to be 350000 * (4*250 + 8*16) bytes = 376MB. You
> said "PC" but
> > did not provide R version information, assuming windows then ...
> > With 1GB RAM you should be able to load a 376MB object into
> memory. If you
> > can store the dummies as 'raw' then object size is only 126MB.
> > You don't say how you attempted to load the data. Assuming
> your input data
> > is in text file (or can be) have you tried scan()? Setup the 'what'
> > argument
> > with length 266 and make sure the dummy column are set to
> integer() or
> > raw(). Then x = scan(...); class(x)=" data.frame".
> > What is the result of memory.limit()? If it is 256MB or
> 512MB, then try
> > starting R with --max-mem-size=800M (I forget the syntax
> exactly). Leave a
> > bit of room below 1GB. Once the object is in memory R may
> need to copy it
> > once, or a few times. You may need to close all other apps
> in memory, or
> > send them to swap.
> > I don't really see why your data should not fit into the
> memory you have.
> > Purchasing an extra 1GB may help. Knowing the object size
> calculation (as
> > above) should help you guage whether it is worth it.
> > Have you used process monitor to see the memory growing as
> R loads the
> > data? This can be useful.
> > If all the above fails, then consider 64-bit and purchasing
> as much memory
> > as you can afford. R can use over 64GB RAM+ on 64bit
> machines. Maybe you
> > can
> > hire some time on a 64-bit server farm - i heard its quite
> cheap but never
> > tried it myself. You shouldn't need to go that far with
> this data set
> > though.
> > Hope this helps,
> > Mark
> >
> >
> > Hi Roger,
> >
> > I want to carry out regression analysis on this dataset. So
> I believe
> > I can't read the dataset in chunks. Any other solution?
> >
> > TIA
> > Sachin
> >
> >
> > roger koenker < [hidden email]> wrote:
> > You can read chunks of it at a time and store it in sparse
> matrix form
> > using the packages SparseM or Matrix, but then you need to
> think about
> > what you want to do with it.... least squares sorts of
> things are ok,
> > but other options are somewhat limited...
> >
> >
> > url: www.econ.uiuc.edu/~roger Roger Koenker
> > email [hidden email] Department of Economics
> > vox: 217-333-4558 University of Illinois
> > fax: 217-244-6678 Champaign, IL 61820
> >
> >
> > On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> >
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266
> columns. Out of
> > > 266 columns 250 are dummy variable columns. I am trying
> to read this
> > > data set into R dataframe object but unable to do it due
> to memory
> > > size limitations (object size created is too large to
> handle in R).
> > > Is there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> >
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/p
> > osting-guide.html>
> >
> >
> >  ------------------------------
> > Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone
> calls. Great
> > rates starting at 1¢/min.
> >
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.
> > com/evt=39666/*http://beta.messenger.yahoo.com>
> >
> >
>
> [[alternative HTML version deleted]]
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Handling large dataset & dataframe

Sachin J
Mark:
   
  Thanx for the pointers. As suggested I will explore scan() method.
   
  Andy:
   
  How can I use colClasses in my case. I tried it unsuccessfully. Encountering following error.
 
coltypes<-
c("numeric","factor","numeric","numeric","numeric","numeric","factor",
"numeric","numeric","factor","factor","numeric","numeric","numeric","n
"numeric","numeric","numeric","numeric")

  mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses =
coltypes, strip.white=TRUE)

 ERROR: Error in scan(file = file, what = what, sep = sep, quote =  quote, dec = dec, : scan() expected 'a real', got 'V1'

  Thank again.
   
  Sachin
 
"Liaw, Andy" <[hidden email]> wrote:
  Much easier to use colClasses in read.table, and in many cases just as fast
(or even faster).

Andy

From: Mark Stephens

>
> From ?scan: "the *type* of what gives the type of data to be
> read". So list(integer(), integer(), double(), raw(), ...) In
> your code all columns are being read as character regardless
> of the contents of the character vector.
>
> I have to admit that I have added the *'s in *type*. I have
> been caught out by this too. Its not the most convenient way
> to specify the types of a large number of columns either. As
> you have a lot of columns you might want to do something like
> this: as.list(rep(integer(1),250)), assuming your dummies
> are together, to save typing. Also storage.mode() is useful
> to tell you the precise type (and therefore size) of an
> object e.g. sapply(coltypes,
> storage.mode) is actually the types scan() will use. Note
> that 'numeric' could be 'double' or 'integer' which are
> important in your case to fit inside the 1GB limit, because
> 'integer' (4 bytes) is half 'double' (8 bytes).
>
> Perhaps someone on r-devel could enhance the documentation to
> make "type" stand out in capitals in bold in help(scan)? Or
> maybe scan could be clever enough to accept a character
> vector 'what'. Or maybe I'm missing a good reason why this
> isn't possible - anyone? How about allowing a character
> vector length one, with each character representing the type
> of that column e.g. what="IIIIDDCD" would mean 4 integers
> followed by 2 double's followed by a character column,
> followed finally by a double column, 8 columns in total.
> Probably someone somewhere has done that already, but I'm not
> aware anyone has wrapped it up conveniently?
>
> On 25/04/06, Sachin J wrote:
> >
> > Mark:
> >
> > Here is the information I didn't provide in my earlier
> post. R version
> > is R2.2.1 running on Windows XP. My dataset has 16 variables with
> > following data type.
> > ColNumber: 1 2 3 .......16
> > Datatypes:
> >
> >
> "numeric","numeric","numeric","numeric","numeric","numeric","character
> >
> ","numeric","numeric","character","character","numeric","numeric","num
> > eric","numeric","numeric","numeric","numeric"
> >
> > Variable (2) which is numeric and variables denoted as
> character are
> > to be treated as dummy variables in the regression.
> >
> > Search in R help list suggested I can use read.csv with colClasses
> > option also instead of using scan() and then converting it to
> > dataframe as you suggested. I am trying both these methods
> but unable
> > to resolve syntactical error.
> >
> > >coltypes<-
> >
> c("numeric","factor","numeric","numeric","numeric","numeric","factor",
> >
> "numeric","numeric","factor","factor","numeric","numeric","numeric","n
> > umeric","numeric","numeric","numeric")
> >
> > >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses =
> > >coltypes,
> > strip.white=TRUE)
> >
> > ERROR: Error in scan(file = file, what = what, sep = sep, quote =
> > quote, dec = dec, :
> > scan() expected 'a real', got 'V1'
> >
> > No idea whats the problem.
> >
> > AS PER YOUR SUGGESTION I TRIED scan() as follows:
> >
> >
> >
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric
> >
> >","factor","numeric","numeric","factor","factor","numeric","n
> umeric","numeric","numeric","numeric","numeric","numeric")
> > >x<-scan(file =
> "C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
> >
> > >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> > >x<-as.data.frame(x)
> >
> > This is working fine but x has no data in it and contains
> > > x
> >
> > [1] X._. NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6
> NA..7 NA..8
> > NA..9 NA..10 NA..11
> > [14] NA..12 NA..13 NA..14 NA..15 NA..16
> > <0 rows> (or 0-length row.names)
> >
> > Please let me know how to properly use scan or colClasses option.
> >
> > Sachin
> >
> >
> >
> >
> >
> > *Mark Stephens * wrote:
> >
> > Sachin,
> > With your dummies stored as integer, the size of your object would
> > appear to be 350000 * (4*250 + 8*16) bytes = 376MB. You
> said "PC" but
> > did not provide R version information, assuming windows then ...
> > With 1GB RAM you should be able to load a 376MB object into
> memory. If you
> > can store the dummies as 'raw' then object size is only 126MB.
> > You don't say how you attempted to load the data. Assuming
> your input data
> > is in text file (or can be) have you tried scan()? Setup the 'what'
> > argument
> > with length 266 and make sure the dummy column are set to
> integer() or
> > raw(). Then x = scan(...); class(x)=" data.frame".
> > What is the result of memory.limit()? If it is 256MB or
> 512MB, then try
> > starting R with --max-mem-size=800M (I forget the syntax
> exactly). Leave a
> > bit of room below 1GB. Once the object is in memory R may
> need to copy it
> > once, or a few times. You may need to close all other apps
> in memory, or
> > send them to swap.
> > I don't really see why your data should not fit into the
> memory you have.
> > Purchasing an extra 1GB may help. Knowing the object size
> calculation (as
> > above) should help you guage whether it is worth it.
> > Have you used process monitor to see the memory growing as
> R loads the
> > data? This can be useful.
> > If all the above fails, then consider 64-bit and purchasing
> as much memory
> > as you can afford. R can use over 64GB RAM+ on 64bit
> machines. Maybe you
> > can
> > hire some time on a 64-bit server farm - i heard its quite
> cheap but never
> > tried it myself. You shouldn't need to go that far with
> this data set
> > though.
> > Hope this helps,
> > Mark
> >
> >
> > Hi Roger,
> >
> > I want to carry out regression analysis on this dataset. So
> I believe
> > I can't read the dataset in chunks. Any other solution?
> >
> > TIA
> > Sachin
> >
> >
> > roger koenker < [hidden email]> wrote:
> > You can read chunks of it at a time and store it in sparse
> matrix form
> > using the packages SparseM or Matrix, but then you need to
> think about
> > what you want to do with it.... least squares sorts of
> things are ok,
> > but other options are somewhat limited...
> >
> >
> > url: www.econ.uiuc.edu/~roger Roger Koenker
> > email [hidden email] Department of Economics
> > vox: 217-333-4558 University of Illinois
> > fax: 217-244-6678 Champaign, IL 61820
> >
> >
> > On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
> >
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266
> columns. Out of
> > > 266 columns 250 are dummy variable columns. I am trying
> to read this
> > > data set into R dataframe object but unable to do it due
> to memory
> > > size limitations (object size created is too large to
> handle in R).
> > > Is there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> >
> http://www.R-project.org/posting-guide.html> > osting-guide.html>
> >
> >
> > ------------------------------

> calls. Great
> > rates starting at 1¢/min.
> >
> > > com/evt=39666/*http://beta.messenger.yahoo.com>
> >
> >
>
> [[alternative HTML version deleted]]
>
>

------------------------------------------------------------------------------

------------------------------------------------------------------------------


               
---------------------------------

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html