|
Does anyone know of any X^2 tests to compare the fit of logistic models which factor out the sample size? I'm dealing with a very large sample and I fear the significant X^2 test I get when adding a variable to the model is simply a result of the sample size (>200,000 cases). I'd rather use the whole dataset instead of taking (small) random samples as it is highly skewed. I've seen things like Phi and Cramer's V for crosstabs but I'm not sure whether they have been used before on logistic regression, if there are better ones and if there are any packages. Many thanks Marco [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Jul 31, 2012, at 10:35 AM, M Pomati <[hidden email]> wrote:
> > > Does anyone know of any X^2 tests to compare the fit of logistic models > which factor out the sample size? I'm dealing with a very large sample and > I fear the significant X^2 test I get when adding a variable to the model > is simply a result of the sample size (>200,000 cases). > > I'd rather use the whole dataset instead of taking (small) random samples > as it is highly skewed. I've seen things like Phi and Cramer's V for > crosstabs but I'm not sure whether they have been used before on logistic > regression, if there are better ones and if there are any packages. > > > Many thanks > > Marco Sounds like you are bordering on some type of stepwise approach to including or not including covariates in the model. You can search the list archives for a myriad of discussions as to why that is a poor approach. You have the luxury of a large sample. You also have the challenge of interpreting covariates that appear to be statistically significant, but may have a rather small *effect size* in context. That is where subject matter experts need to provide input as to interpretation of the contextual significance of the variable, as opposed to the statistical significance of that same variable. A general approach, is to simply pre-specify your model based upon rather simple considerations. Also, you need to determine if your goal for the model is prediction or explanation. What is the incidence of your 'event' in the sample? If it is say 10%, then you should have around 20,000 events. The rule of thumb for logistic regression is to have around 20 events per covariate degree of freedom (df) to minimize the risk of over-fitting the model to your dataset. A continuous covariate is 1 df, a k-level factor is k-1 df. So with 20,000 events, your model could feasibly have 1,000 covariate df's. I am guessing that you don't have that much independent data to begin with. So, pre-specfy your model on the full dataset and stick with it. Interact with subject matter experts on the interpretation of the model. BTW, this question is really about statistical modeling generally, not really R specific. Such queries are best posed to general statistical lists/forums such as Stack Exchange. I would also point you to Frank Harrell's book, Regression Modeling Strategies. Regards, Marc Schwartz ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Marc, thank you very much for your help.
I've posted in on <http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models> and added details. Many thanks Marco --On 31 July 2012 11:50 -0500 Marc Schwartz <[hidden email]> wrote: > On Jul 31, 2012, at 10:35 AM, M Pomati <[hidden email]> wrote: > > > > > > > Does anyone know of any X^2 tests to compare the fit of logistic models > > which factor out the sample size? I'm dealing with a very large sample and > > I fear the significant X^2 test I get when adding a variable to the model > > is simply a result of the sample size (>200,000 cases). > > > > I'd rather use the whole dataset instead of taking (small) random samples > > as it is highly skewed. I've seen things like Phi and Cramer's V for > > crosstabs but I'm not sure whether they have been used before on logistic > > regression, if there are better ones and if there are any packages. > > > > > > Many thanks > > > > Marco > > > > Sounds like you are bordering on some type of stepwise approach to archives for a myriad of discussions as to why that is a poor approach. > > You have the luxury of a large sample. You also have the challenge of interpreting covariates that appear to be statistically significant, but may have a rather small *effect size* in context. That is where subject matter experts need to provide input as to interpretation of the contextual significance of the variable, as opposed to the statistical significance of that same variable. > > A general approach, is to simply pre-specify your model based upon rather simple considerations. Also, you need to determine if your goal for the model is prediction or explanation. > > What is the incidence of your 'event' in the sample? If it is say 10%, then you should have around 20,000 events. The rule of thumb for logistic regression is to have around 20 events per covariate degree of freedom (df) to minimize the risk of over-fitting the model to your dataset. A continuous covariate is 1 df, a k-level factor is k-1 df. So with 20,000 events, your model could feasibly have 1,000 covariate df's. I am guessing that you don't have that much independent data to begin with. > > So, pre-specfy your model on the full dataset and stick with it. Interact with subject matter experts on the interpretation of the model. > > BTW, this question is really about statistical modeling generally, not really R specific. Such queries are best posed to general statistical lists/forums such as Stack Exchange. I would also point you to Frank Harrell's book, Regression Modeling Strategies. > > Regards, > > Marc Schwartz > > ---------------------- M Pomati University of Bristol School for Policy Studies 8 Priory Road Office:10B Bristol BS8 1TZ, UK http://www.bristol.ac.uk/sps/research/centres/poverty [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Jul 31, 2012, at 10:25 AM, M Pomati wrote: > Marc, thank you very much for your help. > I've posted in on > > <http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models > > > > and added details. I think you might have gotten a more statistically knowledgeable audience at: http://stats.stackexchange.com/ (And I suggested to the moderators at math-SE that it be migrated.) -- David. > > Many thanks > > Marco > > --On 31 July 2012 11:50 -0500 Marc Schwartz <[hidden email]> > wrote: > >> On Jul 31, 2012, at 10:35 AM, M Pomati <[hidden email]> >> wrote: >> >>> Does anyone know of any X^2 tests to compare the fit of logistic >>> models >>> which factor out the sample size? I'm dealing with a very large >>> sample and >>> I fear the significant X^2 test I get when adding a variable to >>> the model >>> is simply a result of the sample size (>200,000 cases). >>> >>> I'd rather use the whole dataset instead of taking (small) random >>> samples >>> as it is highly skewed. I've seen things like Phi and Cramer's V for >>> crosstabs but I'm not sure whether they have been used before on >>> logistic >>> regression, if there are better ones and if there are any packages. >>> >>> >>> Many thanks >>> >>> Marco >> >> >> Sounds like you are bordering on some type of stepwise approach to > including or not including covariates in the model. You can search > the list > archives for a myriad of discussions as to why that is a poor > approach. >> >> You have the luxury of a large sample. You also have the challenge of > interpreting covariates that appear to be statistically significant, > but > may have a rather small *effect size* in context. That is where > subject > matter experts need to provide input as to interpretation of the > contextual > significance of the variable, as opposed to the statistical > significance of > that same variable. >> >> A general approach, is to simply pre-specify your model based upon >> rather > simple considerations. Also, you need to determine if your goal for > the > model is prediction or explanation. >> >> What is the incidence of your 'event' in the sample? If it is say >> 10%, > then you should have around 20,000 events. The rule of thumb for > logistic > regression is to have around 20 events per covariate degree of > freedom (df) > to minimize the risk of over-fitting the model to your dataset. A > continuous covariate is 1 df, a k-level factor is k-1 df. So with > 20,000 > events, your model could feasibly have 1,000 covariate df's. I am > guessing > that you don't have that much independent data to begin with. >> >> So, pre-specfy your model on the full dataset and stick with it. >> Interact > with subject matter experts on the interpretation of the model. >> >> BTW, this question is really about statistical modeling generally, >> not > really R specific. Such queries are best posed to general statistical > lists/forums such as Stack Exchange. I would also point you to Frank > Harrell's book, Regression Modeling Strategies. >> >> Regards, >> >> Marc Schwartz >> > ---------------------- > M Pomati > University of Bristol > David Winsemius, MD Alameda, CA, USA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
