Guidance on step() with large dataset (750K) solicited...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Guidance on step() with large dataset (750K) solicited...

Jeffrey S. Racine
Hi.

Background - I am working with a dataset involving around 750K
observations, where many of the variables (8/11) are unordered factors.

The typical model used to model this relationship in the literature has
been a simple linear additive model, but this is rejected out of hand by
the data. I was asked to model this via kernel methods, but first wanted
to play with the parametric specification out of curiosity.

I thought it would be interesting to see what type of model stepwise BIC
would yield, and have been playing with the step() function (on R-beta
due to the factor.scope() problem that has been fixed in the patched and
beta version).

I am running this on a 64bit box with 32GB of RAM and tons of swap, but
am hitting the memory wall as occasionally memory needs grow to ungodly
proportions (in the early iterations the program starts out around 8GB
but quickly grows to 15GB, then grows from there). This is not due to my
using the beta version, as this also arises under R-2.2.1 for what that
is worth.

My question is whether or not there is some simple way to substantially
reduce the memory footprint for this procedure. I took a look at
previous posts for step() and memory issues, but still wonder whether
there might be a switch or possibly better way of constructing my model
that would overcome the memory issues.

I include the code below, and any comments or suggestions would be most
welcome (besides `what type of idiot lets information criteria determine
their model ;-)')

Thanks ever so much in advance.

-- Jeff

---- Begin ----

## Read in the full data set (n=745466 observations)

data <- read.table("../data_header.dat",header=TRUE)

## Create a data frame with all categorical variables declared as
## unordered factors

data <- data.frame(logrprice=data$logrprice,
                   cgt=factor(data$cgt),
                   cag=factor(data$cag),
                   gstann=factor(data$gstann),
                   fhogann=factor(data$fhogann),
                   gstfhog=factor(data$gstfhog),
                   luc=factor(data$luc),
                   municipality=factor(data$municipality),
                   time=factor(data$time),
                   distance=data$distance,
                   logr=data$logr,
                   loginc=data$loginc)

## Estimate a simple linear model (used repeatedly in the literature,
## fails the most simple of model specification tests e.g.,
## resettest())

model.linear <- lm(logrprice~.,data=data)

## Now conduct stepwise (BIC) regression using the step() function in
## the stats library. The lower model is the unconditional mean of y,
## the upper having polynomials of up to order 6 in the three
## continuous covariates, with interaction among all variables of
## order 2.

n <- nrow(data)

model.bic <- step(model.linear,
                  scope=list(
                    lower=~ 1,
                    upper=~ (.
                             +I(logr^2)
                             +I(logr^3)
                             +I(logr^4)
                             +I(logr^5)
                             +I(logr^6)
                             +I(distance^2)
                             +I(distance^3)
                             +I(distance^4)
                             +I(distance^5)
                             +I(distance^6)
                             +I(loginc^2)
                             +I(loginc^3)
                             +I(loginc^4)
                             +I(loginc^5)
                             +I(loginc^6))
                    ^2),
                  trace=TRUE,
                  k=log(n)
                  )

summary(model.bic)

---- End ----
--
Professor J. S. Racine         Phone:  (905) 525 9140 x 23825
Department of Economics        FAX:    (905) 521-8232
McMaster University            e-mail: [hidden email]
1280 Main St. W.,Hamilton,     URL:
http://www.economics.mcmaster.ca/racine/
Ontario, Canada. L8S 4M4

`The generation of random numbers is too important to be left to
chance.'

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on step() with large dataset (750K) solicited...

RKoenker
Jeff,

I don't know whether this is likely to be feasible, but if you could
replace calls to lm() with calls to a sparse matrix version of lm()
either slm() in SparseM or something similar in Matrix, then I
would think that you should safe from memory problems.  Adapting step
might be more than you really bargained for though, I don't
know the code....

Roger

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    [hidden email]            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:

> Hi.
>
> Background - I am working with a dataset involving around 750K
> observations, where many of the variables (8/11) are unordered  
> factors.
>
> The typical model used to model this relationship in the literature  
> has
> been a simple linear additive model, but this is rejected out of  
> hand by
> the data. I was asked to model this via kernel methods, but first  
> wanted
> to play with the parametric specification out of curiosity.
>
> I thought it would be interesting to see what type of model  
> stepwise BIC
> would yield, and have been playing with the step() function (on R-beta
> due to the factor.scope() problem that has been fixed in the  
> patched and
> beta version).
>
> I am running this on a 64bit box with 32GB of RAM and tons of swap,  
> but
> am hitting the memory wall as occasionally memory needs grow to  
> ungodly
> proportions (in the early iterations the program starts out around 8GB
> but quickly grows to 15GB, then grows from there). This is not due  
> to my
> using the beta version, as this also arises under R-2.2.1 for what  
> that
> is worth.
>
> My question is whether or not there is some simple way to  
> substantially
> reduce the memory footprint for this procedure. I took a look at
> previous posts for step() and memory issues, but still wonder whether
> there might be a switch or possibly better way of constructing my  
> model
> that would overcome the memory issues.
>
> I include the code below, and any comments or suggestions would be  
> most
> welcome (besides `what type of idiot lets information criteria  
> determine
> their model ;-)')
>
> Thanks ever so much in advance.
>
> -- Jeff
>
> ---- Begin ----
>
> ## Read in the full data set (n=745466 observations)
>
> data <- read.table("../data_header.dat",header=TRUE)
>
> ## Create a data frame with all categorical variables declared as
> ## unordered factors
>
> data <- data.frame(logrprice=data$logrprice,
>                    cgt=factor(data$cgt),
>                    cag=factor(data$cag),
>                    gstann=factor(data$gstann),
>                    fhogann=factor(data$fhogann),
>                    gstfhog=factor(data$gstfhog),
>                    luc=factor(data$luc),
>                    municipality=factor(data$municipality),
>                    time=factor(data$time),
>                    distance=data$distance,
>                    logr=data$logr,
>                    loginc=data$loginc)
>
> ## Estimate a simple linear model (used repeatedly in the literature,
> ## fails the most simple of model specification tests e.g.,
> ## resettest())
>
> model.linear <- lm(logrprice~.,data=data)
>
> ## Now conduct stepwise (BIC) regression using the step() function in
> ## the stats library. The lower model is the unconditional mean of y,
> ## the upper having polynomials of up to order 6 in the three
> ## continuous covariates, with interaction among all variables of
> ## order 2.
>
> n <- nrow(data)
>
> model.bic <- step(model.linear,
>                   scope=list(
>                     lower=~ 1,
>                     upper=~ (.
>                              +I(logr^2)
>                              +I(logr^3)
>                              +I(logr^4)
>                              +I(logr^5)
>                              +I(logr^6)
>                              +I(distance^2)
>                              +I(distance^3)
>                              +I(distance^4)
>                              +I(distance^5)
>                              +I(distance^6)
>                              +I(loginc^2)
>                              +I(loginc^3)
>                              +I(loginc^4)
>                              +I(loginc^5)
>                              +I(loginc^6))
>                     ^2),
>                   trace=TRUE,
>                   k=log(n)
>                   )
>
> summary(model.bic)
>
> ---- End ----
> --
> Professor J. S. Racine         Phone:  (905) 525 9140 x 23825
> Department of Economics        FAX:    (905) 521-8232
> McMaster University            e-mail: [hidden email]
> 1280 Main St. W.,Hamilton,     URL:
> http://www.economics.mcmaster.ca/racine/
> Ontario, Canada. L8S 4M4
>
> `The generation of random numbers is too important to be left to
> chance.'
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on step() with large dataset (750K) solicited...

Brian Ripley
On Thu, 13 Apr 2006, roger koenker wrote:

> Jeff,
>
> I don't know whether this is likely to be feasible, but if you could
> replace calls to lm() with calls to a sparse matrix version of lm()
> either slm() in SparseM or something similar in Matrix, then I
> would think that you should safe from memory problems.  Adapting step
> might be more than you really bargained for though, I don't
> know the code....

It's a simple wrapper that has been used for many model-fitting classes.
All you need is an extractAIC method.

>
> Roger
>
> url:    www.econ.uiuc.edu/~roger            Roger Koenker
> email    [hidden email]            Department of Economics
> vox:     217-333-4558                University of Illinois
> fax:       217-244-6678                Champaign, IL 61820
>
>
> On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:
>
>> Hi.
>>
>> Background - I am working with a dataset involving around 750K
>> observations, where many of the variables (8/11) are unordered
>> factors.
>>
>> The typical model used to model this relationship in the literature
>> has
>> been a simple linear additive model, but this is rejected out of
>> hand by
>> the data. I was asked to model this via kernel methods, but first
>> wanted
>> to play with the parametric specification out of curiosity.
>>
>> I thought it would be interesting to see what type of model
>> stepwise BIC
>> would yield, and have been playing with the step() function (on R-beta
>> due to the factor.scope() problem that has been fixed in the
>> patched and
>> beta version).
>>
>> I am running this on a 64bit box with 32GB of RAM and tons of swap,
>> but
>> am hitting the memory wall as occasionally memory needs grow to
>> ungodly
>> proportions (in the early iterations the program starts out around 8GB
>> but quickly grows to 15GB, then grows from there). This is not due
>> to my
>> using the beta version, as this also arises under R-2.2.1 for what
>> that
>> is worth.
>>
>> My question is whether or not there is some simple way to
>> substantially
>> reduce the memory footprint for this procedure. I took a look at
>> previous posts for step() and memory issues, but still wonder whether
>> there might be a switch or possibly better way of constructing my
>> model
>> that would overcome the memory issues.
>>
>> I include the code below, and any comments or suggestions would be
>> most
>> welcome (besides `what type of idiot lets information criteria
>> determine
>> their model ;-)')
>>
>> Thanks ever so much in advance.
>>
>> -- Jeff
>>
>> ---- Begin ----
>>
>> ## Read in the full data set (n=745466 observations)
>>
>> data <- read.table("../data_header.dat",header=TRUE)
>>
>> ## Create a data frame with all categorical variables declared as
>> ## unordered factors
>>
>> data <- data.frame(logrprice=data$logrprice,
>>                    cgt=factor(data$cgt),
>>                    cag=factor(data$cag),
>>                    gstann=factor(data$gstann),
>>                    fhogann=factor(data$fhogann),
>>                    gstfhog=factor(data$gstfhog),
>>                    luc=factor(data$luc),
>>                    municipality=factor(data$municipality),
>>                    time=factor(data$time),
>>                    distance=data$distance,
>>                    logr=data$logr,
>>                    loginc=data$loginc)
>>
>> ## Estimate a simple linear model (used repeatedly in the literature,
>> ## fails the most simple of model specification tests e.g.,
>> ## resettest())
>>
>> model.linear <- lm(logrprice~.,data=data)
>>
>> ## Now conduct stepwise (BIC) regression using the step() function in
>> ## the stats library. The lower model is the unconditional mean of y,
>> ## the upper having polynomials of up to order 6 in the three
>> ## continuous covariates, with interaction among all variables of
>> ## order 2.
>>
>> n <- nrow(data)
>>
>> model.bic <- step(model.linear,
>>                   scope=list(
>>                     lower=~ 1,
>>                     upper=~ (.
>>                              +I(logr^2)
>>                              +I(logr^3)
>>                              +I(logr^4)
>>                              +I(logr^5)
>>                              +I(logr^6)
>>                              +I(distance^2)
>>                              +I(distance^3)
>>                              +I(distance^4)
>>                              +I(distance^5)
>>                              +I(distance^6)
>>                              +I(loginc^2)
>>                              +I(loginc^3)
>>                              +I(loginc^4)
>>                              +I(loginc^5)
>>                              +I(loginc^6))
>>                     ^2),
>>                   trace=TRUE,
>>                   k=log(n)
>>                   )
>>
>> summary(model.bic)
>>
>> ---- End ----
>> --
>> Professor J. S. Racine         Phone:  (905) 525 9140 x 23825
>> Department of Economics        FAX:    (905) 521-8232
>> McMaster University            e-mail: [hidden email]
>> 1280 Main St. W.,Hamilton,     URL:
>> http://www.economics.mcmaster.ca/racine/
>> Ontario, Canada. L8S 4M4
>>
>> `The generation of random numbers is too important to be left to
>> chance.'
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-
>> guide.html
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html