Thanks for doing this Thomas, I have been thinking about what it would

take to do this, but if it were left to me, it would have taken a lot

longer.

Back in the 80's there was a statistical package called RUMMAGE that did

all computations based on sufficient statistics and did not keep the

actual data in memory. Memory for computers became cheap before

datasets turned huge so there wasn't much demand for the program (and it

never had a nice GUI to help make it popular). It looks like things are

switching back to that model now though.

Here are a couple of thought that I had that maybe could help with some

future development:

Another function that could be helpful is bigplot which I imagine would

be best based on the hexbin package, just accumulating the counts in

chunks like your biglm function. Once I see the code for biglm I may be

able to contribute this piece. I guess bigbarplot and bigboxplot may

also be useful (accumulating counts for the barplot will be easy, but

does anyone have ideas on the best way to get quantiles for the boxplots

efficiently (the best approach I can think of so far is to have the

database sort the variables, but sorting tends to be slow)).

Another general approach that I thought of would be to read the data in

in chunks, compute the statistic(s) of interest on each chunk (vector of

coefficients for regression models) then average the estimates across

chunks. Each chunk could be treated as a cluster in a cluster sample

for the averaging and estimating variances for the estimates (if only we

can get the author of the survey package involved :-). This would

probably be less accurate than your biglm function for regression, but

it would have the flavor of the bootstrapping routines in that it would

work for many cases that don't have their own big methods written yet

(logistic and other glm models, correlations, ...).

Any other thoughts anyone?

--

Gregory (Greg) L. Snow Ph.D.

Statistical Data Center

Intermountain Healthcare

[hidden email]
(801) 408-8111

-----Original Message-----

From:

[hidden email]
[mailto:

[hidden email]] On Behalf Of Thomas Lumley

Sent: Tuesday, May 16, 2006 3:40 PM

To: roger koenker

Cc: r-help list; Robert Citek

Subject: Re: [R] Re : Large database help

On Tue, 16 May 2006, roger koenker wrote:

> In ancient times, 1999 or so, Alvaro Novo and I experimented with an

> interface to mysql that brought chunks of data into R and accumulated

> results.

> This is still described and available on the web in its original form

> at

>

>

http://www.econ.uiuc.edu/~roger/research/rq/LM.html>

> Despite claims of "future developments" nothing emerged, so anyone

> considering further explorations with it may need training in

> Rchaeology.

A few hours ago I submitted to CRAN a package "biglm" that does large

linear regression models using a similar strategy (it uses incremental

QR decomposition rather than accumalating the crossproduct matrix). It

also computes the Huber/White sandwich variance estimate in the same

single pass over the data.

Assuming I haven't messed up the package checking it will appear in the

next couple of day on CRAN. The syntax looks like

a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)

a <- update(a, chunk2)

a <- update(a, chunk3)

summary(a)

where chunk1, chunk2, chunk3 are chunks of the data.

-thomas

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html