Logistic regression with multiple imputation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Logistic regression with multiple imputation

Daniel Chen-2
Hi,

I am a long time SPSS user but new to R, so please bear with me if my
questions seem to be too basic for you guys.

I am trying to figure out how to analyze survey data using logistic
regression with multiple imputation.

I have a survey data of about 200,000 cases and I am trying to predict the
odds ratio of a dependent variable using 6 categorical independent variables
(dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
in one or more of the independent variables. The percentage of missing
ranges from 0.01% to 10% for the independent variables.

My current thinking is to conduct a logistic regression with multiple
imputation, but I don't know how to do it in R. I searched the web but
couldn't find instructions or examples on how to do this. Since SPSS is
hopeless with missing data, I have to learn to do this in R. I am new to R,
so I would really appreciate if someone can show me some examples or tell me
where to find resources.

Thank you!

Daniel

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

Jeremy Miles-2
Hi Daniel

First, newer versions of SPSS have dramatically improved their ability
to do stuff with missing data - I believe it's an additional module,
and in SPSS-world, each additional module = $$$.

Analyzing missing data is a 3 step process.  First, you impute,
creating multiple datasets, then you analyze each dataset in the
conventional way, then you combine the results.   There are two (that
I know of) packages for imputaton - these are mi and mice.  rseek.org
will find them for you.

Hope that helps,

Jeremy




On 29 June 2010 22:14, Daniel Chen <[hidden email]> wrote:

> Hi,
>
> I am a long time SPSS user but new to R, so please bear with me if my
> questions seem to be too basic for you guys.
>
> I am trying to figure out how to analyze survey data using logistic
> regression with multiple imputation.
>
> I have a survey data of about 200,000 cases and I am trying to predict the
> odds ratio of a dependent variable using 6 categorical independent variables
> (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
> in one or more of the independent variables. The percentage of missing
> ranges from 0.01% to 10% for the independent variables.
>
> My current thinking is to conduct a logistic regression with multiple
> imputation, but I don't know how to do it in R. I searched the web but
> couldn't find instructions or examples on how to do this. Since SPSS is
> hopeless with missing data, I have to learn to do this in R. I am new to R,
> so I would really appreciate if someone can show me some examples or tell me
> where to find resources.
>
> Thank you!
>
> Daniel
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jeremy Miles
Psychology Research Methods Wiki: www.researchmethodsinpsychology.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

Simon Blomberg-4
mitools is useful too, and I can vouch for mice. mice is easy to use,
and easy to write new imputation methods too. So it is also very flexible.

Simon.

On 30/06/10 15:31, Jeremy Miles wrote:

> Hi Daniel
>
> First, newer versions of SPSS have dramatically improved their ability
> to do stuff with missing data - I believe it's an additional module,
> and in SPSS-world, each additional module = $$$.
>
> Analyzing missing data is a 3 step process.  First, you impute,
> creating multiple datasets, then you analyze each dataset in the
> conventional way, then you combine the results.   There are two (that
> I know of) packages for imputaton - these are mi and mice.  rseek.org
> will find them for you.
>
> Hope that helps,
>
> Jeremy
>
>
>
>
> On 29 June 2010 22:14, Daniel Chen<[hidden email]>  wrote:
>    
>> Hi,
>>
>> I am a long time SPSS user but new to R, so please bear with me if my
>> questions seem to be too basic for you guys.
>>
>> I am trying to figure out how to analyze survey data using logistic
>> regression with multiple imputation.
>>
>> I have a survey data of about 200,000 cases and I am trying to predict the
>> odds ratio of a dependent variable using 6 categorical independent variables
>> (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
>> in one or more of the independent variables. The percentage of missing
>> ranges from 0.01% to 10% for the independent variables.
>>
>> My current thinking is to conduct a logistic regression with multiple
>> imputation, but I don't know how to do it in R. I searched the web but
>> couldn't find instructions or examples on how to do this. Since SPSS is
>> hopeless with missing data, I have to learn to do this in R. I am new to R,
>> so I would really appreciate if someone can show me some examples or tell me
>> where to find resources.
>>
>> Thank you!
>>
>> Daniel
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>      
>
>
>    

--
Simon Blomberg, BSc (Hons), PhD, MAppStat.
Lecturer and Consultant Statistician
School of Biological Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
T: +61 7 3365 2506
email: S.Blomberg1_at_uq.edu.au
http://www.uq.edu.au/~uqsblomb/

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem

Statistics is the grammar of science - Karl Pearson.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

Chuck Cleland
In reply to this post by Daniel Chen-2
On 6/30/2010 1:14 AM, Daniel Chen wrote:

> Hi,
>
> I am a long time SPSS user but new to R, so please bear with me if my
> questions seem to be too basic for you guys.
>
> I am trying to figure out how to analyze survey data using logistic
> regression with multiple imputation.
>
> I have a survey data of about 200,000 cases and I am trying to predict the
> odds ratio of a dependent variable using 6 categorical independent variables
> (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
> in one or more of the independent variables. The percentage of missing
> ranges from 0.01% to 10% for the independent variables.
>
> My current thinking is to conduct a logistic regression with multiple
> imputation, but I don't know how to do it in R. I searched the web but
> couldn't find instructions or examples on how to do this. Since SPSS is
> hopeless with missing data, I have to learn to do this in R. I am new to R,
> so I would really appreciate if someone can show me some examples or tell me
> where to find resources.

  Here is an example using the Amelia package to generate imputations
and the mitools and mix packages to make the pooled inferences.

titanic <-
read.table("http://lib.stat.cmu.edu/S/Harrell/data/ascii/titanic.txt",
sep=',', header=TRUE)

set.seed(4321)

titanic$sex[sample(nrow(titanic), 10)] <- NA
titanic$pclass[sample(nrow(titanic), 10)] <- NA
titanic$survived[sample(nrow(titanic), 10)] <- NA

library(Amelia) # generate multiple imputations
library(mitools) # for MIextract()
library(mix) # for mi.inference()

titanic.amelia <- amelia(subset(titanic,
select=c('survived','pclass','sex','age')),
                         m=10, noms=c('survived','pclass','sex'),
emburn=c(500,500))

allimplogreg <- lapply(titanic.amelia$imputations,
function(x){glm(survived ~ pclass + sex + age, family=binomial, data = x)})

mice.betas.glm <- MIextract(allimplogreg, fun=function(x){coef(x)})
mice.se.glm <- MIextract(allimplogreg, fun=function(x){sqrt(diag(vcov(x)))})

as.data.frame(mi.inference(mice.betas.glm, mice.se.glm))

# Or using only mitools for pooled inference

betas <- MIextract(allimplogreg, fun=coef)
vars <- MIextract(allimplogreg, fun=vcov)
summary(MIcombine(betas,vars))

> Thank you!
>
> Daniel
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Chuck Cleland, Ph.D.
NDRI, Inc. (www.ndri.org)
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 512-0171 (M, W, F)
fax: (917) 438-0894

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

Frank Harrell
There are titanic datasets in R binary format at
http://biostat.mc.vanderbilt.edu/DataSets

Note that the aregImpute function in the Hmisc package streamlines many
of the steps, in conjunction with the fit.mult.impute function.

Frank


On 06/30/2010 05:02 AM, Chuck Cleland wrote:

> On 6/30/2010 1:14 AM, Daniel Chen wrote:
>> Hi,
>>
>> I am a long time SPSS user but new to R, so please bear with me if my
>> questions seem to be too basic for you guys.
>>
>> I am trying to figure out how to analyze survey data using logistic
>> regression with multiple imputation.
>>
>> I have a survey data of about 200,000 cases and I am trying to predict the
>> odds ratio of a dependent variable using 6 categorical independent variables
>> (dummy-coded). Approximatively 10% of the cases (~20,000) have missing data
>> in one or more of the independent variables. The percentage of missing
>> ranges from 0.01% to 10% for the independent variables.
>>
>> My current thinking is to conduct a logistic regression with multiple
>> imputation, but I don't know how to do it in R. I searched the web but
>> couldn't find instructions or examples on how to do this. Since SPSS is
>> hopeless with missing data, I have to learn to do this in R. I am new to R,
>> so I would really appreciate if someone can show me some examples or tell me
>> where to find resources.
>
>    Here is an example using the Amelia package to generate imputations
> and the mitools and mix packages to make the pooled inferences.
>
> titanic<-
> read.table("http://lib.stat.cmu.edu/S/Harrell/data/ascii/titanic.txt",
> sep=',', header=TRUE)
>
> set.seed(4321)
>
> titanic$sex[sample(nrow(titanic), 10)]<- NA
> titanic$pclass[sample(nrow(titanic), 10)]<- NA
> titanic$survived[sample(nrow(titanic), 10)]<- NA
>
> library(Amelia) # generate multiple imputations
> library(mitools) # for MIextract()
> library(mix) # for mi.inference()
>
> titanic.amelia<- amelia(subset(titanic,
> select=c('survived','pclass','sex','age')),
>                           m=10, noms=c('survived','pclass','sex'),
> emburn=c(500,500))
>
> allimplogreg<- lapply(titanic.amelia$imputations,
> function(x){glm(survived ~ pclass + sex + age, family=binomial, data = x)})
>
> mice.betas.glm<- MIextract(allimplogreg, fun=function(x){coef(x)})
> mice.se.glm<- MIextract(allimplogreg, fun=function(x){sqrt(diag(vcov(x)))})
>
> as.data.frame(mi.inference(mice.betas.glm, mice.se.glm))
>
> # Or using only mitools for pooled inference
>
> betas<- MIextract(allimplogreg, fun=coef)
> vars<- MIextract(allimplogreg, fun=vcov)
> summary(MIcombine(betas,vars))
>
>> Thank you!
>>
>> Daniel
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>


--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

Rafael Björk
In reply to this post by Chuck Cleland
In addition to the tips above, you may want to chek out:
http://www.stat.columbia.edu/~gelman/arm/missing.pdf

2010/6/30 Chuck Cleland <[hidden email]>

> On 6/30/2010 1:14 AM, Daniel Chen wrote:
> > Hi,
> >
> > I am a long time SPSS user but new to R, so please bear with me if my
> > questions seem to be too basic for you guys.
> >
> > I am trying to figure out how to analyze survey data using logistic
> > regression with multiple imputation.
> >
> > I have a survey data of about 200,000 cases and I am trying to predict
> the
> > odds ratio of a dependent variable using 6 categorical independent
> variables
> > (dummy-coded). Approximatively 10% of the cases (~20,000) have missing
> data
> > in one or more of the independent variables. The percentage of missing
> > ranges from 0.01% to 10% for the independent variables.
> >
> > My current thinking is to conduct a logistic regression with multiple
> > imputation, but I don't know how to do it in R. I searched the web but
> > couldn't find instructions or examples on how to do this. Since SPSS is
> > hopeless with missing data, I have to learn to do this in R. I am new to
> R,
> > so I would really appreciate if someone can show me some examples or tell
> me
> > where to find resources.
>
>   Here is an example using the Amelia package to generate imputations
> and the mitools and mix packages to make the pooled inferences.
>
> titanic <-
> read.table("http://lib.stat.cmu.edu/S/Harrell/data/ascii/titanic.txt",
> sep=',', header=TRUE)
>
> set.seed(4321)
>
> titanic$sex[sample(nrow(titanic), 10)] <- NA
> titanic$pclass[sample(nrow(titanic), 10)] <- NA
> titanic$survived[sample(nrow(titanic), 10)] <- NA
>
> library(Amelia) # generate multiple imputations
> library(mitools) # for MIextract()
> library(mix) # for mi.inference()
>
> titanic.amelia <- amelia(subset(titanic,
> select=c('survived','pclass','sex','age')),
>                         m=10, noms=c('survived','pclass','sex'),
> emburn=c(500,500))
>
> allimplogreg <- lapply(titanic.amelia$imputations,
> function(x){glm(survived ~ pclass + sex + age, family=binomial, data = x)})
>
> mice.betas.glm <- MIextract(allimplogreg, fun=function(x){coef(x)})
> mice.se.glm <- MIextract(allimplogreg,
> fun=function(x){sqrt(diag(vcov(x)))})
>
> as.data.frame(mi.inference(mice.betas.glm, mice.se.glm))
>
> # Or using only mitools for pooled inference
>
> betas <- MIextract(allimplogreg, fun=coef)
> vars <- MIextract(allimplogreg, fun=vcov)
> summary(MIcombine(betas,vars))
>
> > Thank you!
> >
> > Daniel
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Chuck Cleland, Ph.D.
> NDRI, Inc. (www.ndri.org)
> 71 West 23rd Street, 8th floor
> New York, NY 10010
> tel: (212) 845-4495 (Tu, Th)
> tel: (732) 512-0171 (M, W, F)
> fax: (917) 438-0894
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic regression with multiple imputation

David Winsemius
In reply to this post by Daniel Chen-2

On Jun 30, 2010, at 1:14 AM, Daniel Chen wrote:

> Hi,
>
> I am a long time SPSS user but new to R, so please bear with me if my
> questions seem to be too basic for you guys.
>
> I am trying to figure out how to analyze survey data using logistic
> regression with multiple imputation.
>
> I have a survey data of about 200,000 cases and I am trying to  
> predict the
> odds ratio of a dependent variable using 6 categorical independent  
> variables
> (dummy-coded). Approximatively 10% of the cases (~20,000) have  
> missing data
> in one or more of the independent variables. The percentage of missing
> ranges from 0.01% to 10% for the independent variables.
>
> My current thinking is to conduct a logistic regression with multiple
> imputation, but I don't know how to do it in R. I searched the web but
> couldn't find instructions or examples on how to do this. Since SPSS  
> is
> hopeless with missing data, I have to learn to do this in R. I am  
> new to R,
> so I would really appreciate if someone can show me some examples or  
> tell me
> where to find resources.

The rms/Hmisc duo of packages has several functions supporting  
multiple imputation. aregImpute() is nicely integrated with his other  
utility functions and extensively documented in Harrell's excellent  
text: "Regression Modeling Strategies". He also provides quite a bit  
of free, online documentation at his Vanderbilt website. The help page  
for aregImpute is a small chapter in itself with multiple worked  
examples.

install.packages(c("rms", "Hmisc")
reauire(rms) # rms has dependecy of Hmisc which will load automagically
?aregImpute

--
David Winsemius

>
> Thank you!
>
> Daniel
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.