The simplest trick is to use the QR decomposition:

The OLS solution (X'X)^{-1}X'y can be easily computed as:

qr.solve(X, y)

Here is an illustration:

> set.seed(123)

> X <- matrix(round(rnorm(100),1),20,5)

> b <- c(1,1,2,2,3)

> y <- X %*% b + rnorm(20)

>

> ans1 <- solve(t(X)%*%X,t(X)%*%y)

> ans2 <- qr.solve(X,y)

> all.equal(ans1,ans2)

[1] TRUE

Ravi.

----------------------------------------------------------------------------

-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email:

[hidden email]
Webpage:

http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
----------------------------------------------------------------------------

--------

-----Original Message-----

From:

[hidden email]
[mailto:

[hidden email]] On Behalf Of dave fournier

Sent: Friday, August 17, 2007 12:43 PM

To:

[hidden email]
Subject: [R] Linear models over large datasets

>Its actually only a few lines of code to do this from first principles.

>The coefficients depend only on the cross products X'X and X'y and you

>can build them up easily by extending this example to read files or

>a database holding x and y instead of getting them from the args.

>Here we process incr rows of builtin matrix state.x77 at a time

>building up the two cross productxts, xtx and xty, regressing

>Income (variable 2) on the other variables:

>mylm <- function(x, y, incr = 25) {

> start <- xtx <- xty <- 0

> while(start < nrow(x)) {

> idx <- seq(start + 1, min(start + incr, nrow(x)))

> x1 <- cbind(1, x[idx,])

> xtx <- xtx + crossprod(x1)

> xty <- xty + crossprod(x1, y[idx])

> start <- start + incr

> }

> solve(xtx, xty)

>}

>mylm(state.x77[,-2], state.x77[,2])

>On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:

> I'd like to fit linear models on very large datasets. My data frames

> are about 2000000 rows x 200 columns of doubles and I am using an 64

> bit build of R. I've googled about this extensively and went over the

> "R Data Import/Export" guide. My primary issue is although my data

> represented in ascii form is 4Gb in size (therefore much smaller

> considered in binary), R consumes about 12Gb of virtual memory.

>

> What exactly are my options to improve this? I looked into the biglm

> package but the problem with it is it uses update() function and is

> therefore not transparent (I am using a sophisticated script which is

> hard to modify). I really liked the concept behind the LM package

> here:

http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html > But it is no longer available. How could one fit linear models to very

> large datasets without loading the entire set into memory but from a

> file/database (possibly through a connection) using a relatively

> simple modification of standard lm()? Alternatively how could one

> improve the memory usage of R given a large dataset (by changing some

> default parameters of R or even using on-the-fly compression)? I don't

> mind much higher levels of CPU time required.

>

> Thank you in advance for your help.

>

> ______________________________________________

> R-help at stat.math.ethz.ch mailing list

>

https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.

>

If your design matrix X is very well behaved this approach may work for

you. Often just doing solve(X'X,y) will fail for numerical reasons. The

right way to do it is tofactor the matrix X as

X = A * B

where B is 200x200 in your case and A is 2000000 x 200 with

A'*A = I (That is A is orthogonal.)

so X'*X = B' *B and you use

solve(B'*B,y);

To find A and B you can use modified Gram-Schmidt which is very easy to

program and works well when you wish to store the columns of X on a hard

disk and just read in a bit at a time. Some people claim that modifed

Gram-Schmidt is unstable but it has always worked well for me.

In any event you can check to ensure that A'*A = I and

X=A*B

Cheers,

Dave

--

David A. Fournier

P.O. Box 2040,

Sidney, B.C. V8l 3S3

Canada

Phone/FAX 250-655-3364

http://otter-rsch.com______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.