Linear models over large datasets

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Linear models over large datasets

Alp Atıcı
I'd like to fit linear models on very large datasets. My data frames
are about 2000000 rows x 200 columns of doubles and I am using an 64
bit build of R. I've googled about this extensively and went over the
"R Data Import/Export" guide. My primary issue is although my data
represented in ascii form is 4Gb in size (therefore much smaller
considered in binary), R consumes about 12Gb of virtual memory.

What exactly are my options to improve this? I looked into the biglm
package but the problem with it is it uses update() function and is
therefore not transparent (I am using a sophisticated script which is
hard to modify). I really liked the concept behind the  LM package
here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
But it is no longer available. How could one fit linear models to very
large datasets without loading the entire set into memory but from a
file/database (possibly through a connection) using a relatively
simple modification of standard lm()? Alternatively how could one
improve the memory usage of R given a large dataset (by changing some
default parameters of R or even using on-the-fly compression)? I don't
mind much higher levels of CPU time required.

Thank you in advance for your help.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

Gregory Snow
Here are a couple of options that you could look at:

The biglm package also has the bigglm function which you only call once
(no update), you just need to give it a function that reads the data in
chunks for you.  Using bigglm with a gaussian family is equivalent to
lm.

You could also write your own function that calls biglm and the
necessary updates on it, then just call your function.

The SQLiteDF package has an sdflm function that uses the same internal
code as biglm, but based on having the data stored in an sqlite
database.  You don't need to call update with this function.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
(801) 408-8111
 
 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Alp ATICI
> Sent: Thursday, August 16, 2007 2:24 PM
> To: [hidden email]
> Subject: [R] Linear models over large datasets
>
> I'd like to fit linear models on very large datasets. My data
> frames are about 2000000 rows x 200 columns of doubles and I
> am using an 64 bit build of R. I've googled about this
> extensively and went over the "R Data Import/Export" guide.
> My primary issue is although my data represented in ascii
> form is 4Gb in size (therefore much smaller considered in
> binary), R consumes about 12Gb of virtual memory.
>
> What exactly are my options to improve this? I looked into
> the biglm package but the problem with it is it uses update()
> function and is therefore not transparent (I am using a
> sophisticated script which is hard to modify). I really liked
> the concept behind the  LM package
> here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> But it is no longer available. How could one fit linear
> models to very large datasets without loading the entire set
> into memory but from a file/database (possibly through a
> connection) using a relatively simple modification of
> standard lm()? Alternatively how could one improve the memory
> usage of R given a large dataset (by changing some default
> parameters of R or even using on-the-fly compression)? I
> don't mind much higher levels of CPU time required.
>
> Thank you in advance for your help.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

dlakelan
In reply to this post by Alp Atıcı
On Thu, Aug 16, 2007 at 03:24:08PM -0500, Alp ATICI wrote:
> I'd like to fit linear models on very large datasets. My data frames
> are about 2000000 rows x 200 columns of doubles and I am using an 64
> bit build of R. I've googled about this extensively and went over the
> "R Data Import/Export" guide. My primary issue is although my data
> represented in ascii form is 4Gb in size (therefore much smaller
> considered in binary), R consumes about 12Gb of virtual memory.

One option is to simply buy more memory, which might work for you in
this case, but in larger cases, is not scalable.

I'm not sure how to make R happier with handling large datasets, but
you may be able to use the power of random sampling to help you?

Read the data from mysql, selecting a random 10% subset. This should
use 1.2 Gb or so. You then fit the model to this subset. Repeat the
procedure 100 times using independent samples. Now you have
bootstrapped the coefficients of your model. Use the average value and
standard deviation of the coefficients as your coefficient estimates
and standard errors??

Since swapping is typically 1000 times slower or more than disk
access, this process might take 1/10 of the time or less compared to
letting the R process thrash its disk.

It's a thought, not sure how well it works.

--
Daniel Lakeland
[hidden email]
http://www.street-artists.org/~dlakelan

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

Gabor Grothendieck
In reply to this post by Alp Atıcı
Its actually only a few lines of code to do this from first principles.
The coefficients depend only on the cross products X'X and X'y and you
can build them up easily by extending this example to read files or
a database holding x and y instead of getting them from the args.
Here we process incr rows of builtin matrix state.x77 at a time
building up the two cross productxts, xtx and xty, regressing
Income (variable 2) on the other variables:

mylm <- function(x, y, incr = 25) {
        start <- xtx <- xty <- 0
        while(start < nrow(x)) {
            idx <- seq(start + 1, min(start + incr, nrow(x)))
            x1 <- cbind(1, x[idx,])
            xtx <- xtx + crossprod(x1)
            xty <- xty + crossprod(x1, y[idx])
            start <- start + incr
        }
        solve(xtx, xty)
}

mylm(state.x77[,-2], state.x77[,2])


On 8/16/07, Alp ATICI <[hidden email]> wrote:

> I'd like to fit linear models on very large datasets. My data frames
> are about 2000000 rows x 200 columns of doubles and I am using an 64
> bit build of R. I've googled about this extensively and went over the
> "R Data Import/Export" guide. My primary issue is although my data
> represented in ascii form is 4Gb in size (therefore much smaller
> considered in binary), R consumes about 12Gb of virtual memory.
>
> What exactly are my options to improve this? I looked into the biglm
> package but the problem with it is it uses update() function and is
> therefore not transparent (I am using a sophisticated script which is
> hard to modify). I really liked the concept behind the  LM package
> here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> But it is no longer available. How could one fit linear models to very
> large datasets without loading the entire set into memory but from a
> file/database (possibly through a connection) using a relatively
> simple modification of standard lm()? Alternatively how could one
> improve the memory usage of R given a large dataset (by changing some
> default parameters of R or even using on-the-fly compression)? I don't
> mind much higher levels of CPU time required.
>
> Thank you in advance for your help.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Linear models over large datasets

dave fournier
In reply to this post by Alp Atıcı
 >Its actually only a few lines of code to do this from first principles.
 >The coefficients depend only on the cross products X'X and X'y and you
 >can build them up easily by extending this example to read files or
 >a database holding x and y instead of getting them from the args.
 >Here we process incr rows of builtin matrix state.x77 at a time
 >building up the two cross productxts, xtx and xty, regressing
 >Income (variable 2) on the other variables:

 >mylm <- function(x, y, incr = 25) {
 > start <- xtx <- xty <- 0
 > while(start < nrow(x)) {
 >    idx <- seq(start + 1, min(start + incr, nrow(x)))
 >    x1 <- cbind(1, x[idx,])
 >    xtx <- xtx + crossprod(x1)
 >    xty <- xty + crossprod(x1, y[idx])
 >    start <- start + incr
 > }
 > solve(xtx, xty)
 >}

 >mylm(state.x77[,-2], state.x77[,2])


 >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:
 > I'd like to fit linear models on very large datasets. My data frames
 > are about 2000000 rows x 200 columns of doubles and I am using an 64
 > bit build of R. I've googled about this extensively and went over the
 > "R Data Import/Export" guide. My primary issue is although my data
 > represented in ascii form is 4Gb in size (therefore much smaller
 > considered in binary), R consumes about 12Gb of virtual memory.
 >
 > What exactly are my options to improve this? I looked into the biglm
 > package but the problem with it is it uses update() function and is
 > therefore not transparent (I am using a sophisticated script which is
 > hard to modify). I really liked the concept behind the  LM package
 > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
 > But it is no longer available. How could one fit linear models to very
 > large datasets without loading the entire set into memory but from a
 > file/database (possibly through a connection) using a relatively
 > simple modification of standard lm()? Alternatively how could one
 > improve the memory usage of R given a large dataset (by changing some
 > default parameters of R or even using on-the-fly compression)? I don't
 > mind much higher levels of CPU time required.
 >
 > Thank you in advance for your help.
 >
 > ______________________________________________
 > R-help at stat.math.ethz.ch mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 > and provide commented, minimal, self-contained, reproducible code.
 >
If your design matrix X is very well behaved this approach may work for
you. Often just doing solve(X'X,y) will fail for numerical reasons. The
right way to do it is tofactor the matrix X  as

           X = A * B

where B is 200x200 in your case and A is  2000000 x 200 with
   A'*A = I  (That is A is orthogonal.)

so  X'*X = B' *B  and you use

     solve(B'*B,y);

To find A and B you can use modified Gram-Schmidt which is very easy to
program and works well when you wish to store the columns of X on a hard
disk and just read in a bit at a time. Some people claim that modifed
Gram-Schmidt is unstable but it has always worked well for me.
In any event you can check  to ensure that A'*A = I and
   X=A*B

       Cheers,

        Dave

--
David A. Fournier
P.O. Box 2040,
Sidney, B.C. V8l 3S3
Canada
Phone/FAX 250-655-3364
http://otter-rsch.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

Ravi Varadhan
The simplest trick is to use the QR decomposition:

The OLS solution (X'X)^{-1}X'y can be easily computed as:
qr.solve(X, y)

Here is an illustration:

> set.seed(123)
> X <- matrix(round(rnorm(100),1),20,5)
> b <- c(1,1,2,2,3)
> y <- X %*% b + rnorm(20)
>
> ans1 <- solve(t(X)%*%X,t(X)%*%y)
> ans2 <- qr.solve(X,y)
> all.equal(ans1,ans2)
[1] TRUE

Ravi.
----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [hidden email]

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 

----------------------------------------------------------------------------
--------


-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of dave fournier
Sent: Friday, August 17, 2007 12:43 PM
To: [hidden email]
Subject: [R] Linear models over large datasets

 >Its actually only a few lines of code to do this from first principles.
 >The coefficients depend only on the cross products X'X and X'y and you
 >can build them up easily by extending this example to read files or
 >a database holding x and y instead of getting them from the args.
 >Here we process incr rows of builtin matrix state.x77 at a time
 >building up the two cross productxts, xtx and xty, regressing
 >Income (variable 2) on the other variables:

 >mylm <- function(x, y, incr = 25) {
 > start <- xtx <- xty <- 0
 > while(start < nrow(x)) {
 >    idx <- seq(start + 1, min(start + incr, nrow(x)))
 >    x1 <- cbind(1, x[idx,])
 >    xtx <- xtx + crossprod(x1)
 >    xty <- xty + crossprod(x1, y[idx])
 >    start <- start + incr
 > }
 > solve(xtx, xty)
 >}

 >mylm(state.x77[,-2], state.x77[,2])


 >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:
 > I'd like to fit linear models on very large datasets. My data frames
 > are about 2000000 rows x 200 columns of doubles and I am using an 64
 > bit build of R. I've googled about this extensively and went over the
 > "R Data Import/Export" guide. My primary issue is although my data
 > represented in ascii form is 4Gb in size (therefore much smaller
 > considered in binary), R consumes about 12Gb of virtual memory.
 >
 > What exactly are my options to improve this? I looked into the biglm
 > package but the problem with it is it uses update() function and is
 > therefore not transparent (I am using a sophisticated script which is
 > hard to modify). I really liked the concept behind the  LM package
 > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
 > But it is no longer available. How could one fit linear models to very
 > large datasets without loading the entire set into memory but from a
 > file/database (possibly through a connection) using a relatively
 > simple modification of standard lm()? Alternatively how could one
 > improve the memory usage of R given a large dataset (by changing some
 > default parameters of R or even using on-the-fly compression)? I don't
 > mind much higher levels of CPU time required.
 >
 > Thank you in advance for your help.
 >
 > ______________________________________________
 > R-help at stat.math.ethz.ch mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 > and provide commented, minimal, self-contained, reproducible code.
 >
If your design matrix X is very well behaved this approach may work for
you. Often just doing solve(X'X,y) will fail for numerical reasons. The
right way to do it is tofactor the matrix X  as

           X = A * B

where B is 200x200 in your case and A is  2000000 x 200 with
   A'*A = I  (That is A is orthogonal.)

so  X'*X = B' *B  and you use

     solve(B'*B,y);

To find A and B you can use modified Gram-Schmidt which is very easy to
program and works well when you wish to store the columns of X on a hard
disk and just read in a bit at a time. Some people claim that modifed
Gram-Schmidt is unstable but it has always worked well for me.
In any event you can check  to ensure that A'*A = I and
   X=A*B

       Cheers,

        Dave

--
David A. Fournier
P.O. Box 2040,
Sidney, B.C. V8l 3S3
Canada
Phone/FAX 250-655-3364
http://otter-rsch.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

dlakelan
On Fri, Aug 17, 2007 at 01:53:25PM -0400, Ravi Varadhan wrote:
> The simplest trick is to use the QR decomposition:
>
> The OLS solution (X'X)^{-1}X'y can be easily computed as:
> qr.solve(X, y)

While I agree that this is the correct way to solve the linear algebra
problem, I seem to be missing the reason why re-inventing the existing
lm function (which undoubtedly uses a QR decomposition internally)
will solve the problem that was mentioned, namely the massive amount
of memory that the process consumes?

2e6 rows by 200 columns by 8 bytes per double = 3 gigs minimum memory
consumption. The QR decomposition process, or any other solving
process will at least double this to 6 gigs, and it would be
unsurprising to have the overhead cause the whole thing to reach 8
gigs at the peak memory usage.

I'm going to assume that the original user has perhaps 1.5 gigs to 2
gigs available, so any process that even READS IN a matrix of more
than about 1 million rows will exceed the available memory. Hence, my
suggestion to randomly downsample the matrix by a factor of 10, and
then bootstrap the coefficients by repeating the downsampling process
20, 50, or 100 times to take advantage of all of the data available.

Now that I'm aware of the biglm package, I think that it is probably
preferrable.

--
Daniel Lakeland
[hidden email]
http://www.street-artists.org/~dlakelan

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Linear models over large datasets

cberry
In reply to this post by Ravi Varadhan

The original complaint

> > What exactly are my options to improve this? I looked into the biglm
> > package but the problem with it is it uses update() function and is
> > therefore not transparent

is not warranted.

As usual, the source code is the best reference. It took about a minute
to download  biglm_0.4.tar.gz, open it in emacs and browse thru to see
this reference:

  ALGORITHM AS274  APPL. STATIST. (1992) VOL.41, NO. 2

in biglm/src/boundedQRf.f which appears to incremetnally update an
orthogonal decomposition of the design matrix, etc.

This seems VERY transparent.

It would seem quite easy to borrow the Fortran code and the wrappers that
biglm provide and adapt them to some other purpose.

Chuck


On Fri, 17 Aug 2007, Ravi Varadhan wrote:

> The simplest trick is to use the QR decomposition:
>
> The OLS solution (X'X)^{-1}X'y can be easily computed as:
> qr.solve(X, y)
>
> Here is an illustration:
>
>> set.seed(123)
>> X <- matrix(round(rnorm(100),1),20,5)
>> b <- c(1,1,2,2,3)
>> y <- X %*% b + rnorm(20)
>>
>> ans1 <- solve(t(X)%*%X,t(X)%*%y)
>> ans2 <- qr.solve(X,y)
>> all.equal(ans1,ans2)
> [1] TRUE
>
> Ravi.
> ----------------------------------------------------------------------------
> -------
>
> Ravi Varadhan, Ph.D.
>
> Assistant Professor, The Center on Aging and Health
>
> Division of Geriatric Medicine and Gerontology
>
> Johns Hopkins University
>
> Ph: (410) 502-2619
>
> Fax: (410) 614-9625
>
> Email: [hidden email]
>
> Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>
>
>
> ----------------------------------------------------------------------------
> --------
>
>
> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of dave fournier
> Sent: Friday, August 17, 2007 12:43 PM
> To: [hidden email]
> Subject: [R] Linear models over large datasets
>
> >Its actually only a few lines of code to do this from first principles.
> >The coefficients depend only on the cross products X'X and X'y and you
> >can build them up easily by extending this example to read files or
> >a database holding x and y instead of getting them from the args.
> >Here we process incr rows of builtin matrix state.x77 at a time
> >building up the two cross productxts, xtx and xty, regressing
> >Income (variable 2) on the other variables:
>
> >mylm <- function(x, y, incr = 25) {
> > start <- xtx <- xty <- 0
> > while(start < nrow(x)) {
> >    idx <- seq(start + 1, min(start + incr, nrow(x)))
> >    x1 <- cbind(1, x[idx,])
> >    xtx <- xtx + crossprod(x1)
> >    xty <- xty + crossprod(x1, y[idx])
> >    start <- start + incr
> > }
> > solve(xtx, xty)
> >}
>
> >mylm(state.x77[,-2], state.x77[,2])
>
>
> >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:
> > I'd like to fit linear models on very large datasets. My data frames
> > are about 2000000 rows x 200 columns of doubles and I am using an 64
> > bit build of R. I've googled about this extensively and went over the
> > "R Data Import/Export" guide. My primary issue is although my data
> > represented in ascii form is 4Gb in size (therefore much smaller
> > considered in binary), R consumes about 12Gb of virtual memory.
> >
> > What exactly are my options to improve this? I looked into the biglm
> > package but the problem with it is it uses update() function and is
> > therefore not transparent (I am using a sophisticated script which is
> > hard to modify). I really liked the concept behind the  LM package
> > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> > But it is no longer available. How could one fit linear models to very
> > large datasets without loading the entire set into memory but from a
> > file/database (possibly through a connection) using a relatively
> > simple modification of standard lm()? Alternatively how could one
> > improve the memory usage of R given a large dataset (by changing some
> > default parameters of R or even using on-the-fly compression)? I don't
> > mind much higher levels of CPU time required.
> >
> > Thank you in advance for your help.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> If your design matrix X is very well behaved this approach may work for
> you. Often just doing solve(X'X,y) will fail for numerical reasons. The
> right way to do it is tofactor the matrix X  as
>
>           X = A * B
>
> where B is 200x200 in your case and A is  2000000 x 200 with
>   A'*A = I  (That is A is orthogonal.)
>
> so  X'*X = B' *B  and you use
>
>     solve(B'*B,y);
>
> To find A and B you can use modified Gram-Schmidt which is very easy to
> program and works well when you wish to store the columns of X on a hard
> disk and just read in a bit at a time. Some people claim that modifed
> Gram-Schmidt is unstable but it has always worked well for me.
> In any event you can check  to ensure that A'*A = I and
>   X=A*B
>
>       Cheers,
>
>        Dave
>
> --
> David A. Fournier
> P.O. Box 2040,
> Sidney, B.C. V8l 3S3
> Canada
> Phone/FAX 250-655-3364
> http://otter-rsch.com
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:[hidden email]            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.