

I would like to test in R what regression fits my data best. My dependent
variable is a count, and has a lot of zeros.
And I would need some help to determine what model and family to use
(poisson or quasipoisson, or zeroinflated poisson regression), and how to
test the assumptions.
1) Poisson Regression: as far as I understand, the strong assumption is
that dependent variable mean = variance. How do you test this? How close
together do they have to be? Are unconditional or conditional mean and
variance used for this? What do I do if this assumption does not hold?
2) I read that if variance is greater than mean we have overdispersion, and
a potential way to deal with this is including more independent variables,
or family=quasipoisson. Does this distribution have any other requirements
or assumptions? What test do I use to see whether 1) or 2) fits better 
simply anova(m1,m2)?
3) I also read that negativebinomial distribution can be used when
overdispersion appears. How do I do this in R? What is the difference to
quasipoisson?
4) Zeroinflated Poisson Regression: I read that using the vuong test
checks what models fits better.
> vuong (model.poisson, model.zero.poisson)
Is that correct?
5) ats.ucla.edu has a section about zeroinflated Poisson Regressions, and
test the zeroinflated model (a) against the standard poisson model (b):
> m.a < zeroinfl(count ~ child + camper  persons, data = zinb)
> m.b < glm(count ~ child + camper, family = poisson, data = zinb)
> vuong(m.a, m.b)
I don't understand what the " persons" part of the first model does, and
why you can compare these models if. I had expected the regression to be
the same and just use a different family.
Thank you
T
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Sun, 14 Oct 2012, Eiko Fried wrote:
> I would like to test in R what regression fits my data best. My dependent
> variable is a count, and has a lot of zeros.
>
> And I would need some help to determine what model and family to use
> (poisson or quasipoisson, or zeroinflated poisson regression), and how to
> test the assumptions.
>
> 1) Poisson Regression: as far as I understand, the strong assumption is
> that dependent variable mean = variance. How do you test this? How close
> together do they have to be? Are unconditional or conditional mean and
> variance used for this? What do I do if this assumption does not hold?
There are various formal tests for this, e.g., dispersiontest() in package
"AER". Alternatively, you can use a simple likelihoodratio test (e.g., by
means of lrtest() in "lmtest") between a poisson and negative binomial
(NB) fit. The pvalue can even be halved because the Poisson is on the
border of the NB theta parameter range (theta = infty).
However, overdispersion can already matter before this is detected by a
significance test. Hence, if in doubt, I would simply use an NB model and
you're on the safe side. And if the NB's estimated theta parameter turns
out to be extremely large (say beyond 20 or 30), then you can still switch
back to Poisson if you want.
> 2) I read that if variance is greater than mean we have overdispersion,
> and a potential way to deal with this is including more independent
> variables, or family=quasipoisson. Does this distribution have any other
> requirements or assumptions? What test do I use to see whether 1) or 2)
> fits better  simply anova(m1,m2)?
quasipoisson yields the same parameter estimates as the poisson, only the
inference is adjusted appropriately.
> 3) I also read that negativebinomial distribution can be used when
> overdispersion appears. How do I do this in R?
glm.nb() in "MASS" is one of standard options.
> What is the difference to quasipoisson?
The NB is a likelihoodbased model while the quasipoisson is not
associated with a likelihood (but has the same conditional mean equation).
> 4) Zeroinflated Poisson Regression: I read that using the vuong test
> checks what models fits better.
>> vuong (model.poisson, model.zero.poisson)
> Is that correct?
It's one of the possibilities.
> 5) ats.ucla.edu has a section about zeroinflated Poisson Regressions, and
> test the zeroinflated model (a) against the standard poisson model (b):
>> m.a < zeroinfl(count ~ child + camper  persons, data = zinb)
>> m.b < glm(count ~ child + camper, family = poisson, data = zinb)
>> vuong(m.a, m.b)
> I don't understand what the " persons" part of the first model does, and
> why you can compare these models if. I had expected the regression to be
> the same and just use a different family.
I recommend you read the associated documentation. See
vignette("countreg", package = "pscl")
For glm.nb() I recommend its accompanying documentation, namely the MASS
book.
hth,
Z
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Thank you for the detailed answer, that was really helpful.
I did some excessive reading and calculating in the last hours since your
reply, and have a few (hopefully much more informed) follow up questions.
1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC values
are listed for the models Negativebinomial (NB), ZeroInflated (ZI), ZI
NB, Hurdle NB, and Poisson (Standard). And although I found a way to
determine LLH and DF for all models, BIC & AIC values are not displayed by
default, neither using the code given in the vignette. How do I calculate
these? (AIC is given as per default only in 2 models, BIC in none).
2) For the zeroinflated models, the first block of count model
coefficients is only in the output in order to compare the changes,
correct? That is, in my results in a paper I would only report the second
block of (zeroinflation) model coefficients? Or do I misunderstand
something? I am confused because in their large table to compare
coefficients, they primarily compare the first block of coefficients (p. 18)
R> fm < list("MLPois" = fm_pois, "QuasiPois" = fm_qpois, "NB" = fm_nbin,
+ "HurdleNB" = fm_hurdle, "ZINB" = fm_zinb)
R> sapply(fm, function(x) coef(x)[1:8])
3)
> There are various formal tests for this, e.g., dispersiontest() in package
> "AER".
>
I have to run 9 models  I am testing the influence of several predictors
on different individual symptoms of a mental disorder, as "counted" in the
last week (0=never in the last week, to 3 = on all day within the last
week).
So I'm regressing the same predictors onto 9 different symptoms in 9
models.
Dispersiontest() for these 9 models result in 34 overdispersed models
(depending if testing one or twosided on p=.05 level), 2 underdispersed
models, and 4 nonsignificant models. The by far largest dispersion value
is 2.1 in a model is not overdispersed according to the test, but that's
the symptom with 80% zeros, maybe that plays a role here.
Does BN also make sense in underdispersed models?
However, overdispersion can already matter before this is detected by a
> significance test. Hence, if in doubt, I would simply use an NB model and
> you're on the safe side. And if the NB's estimated theta parameter turns
> out to be extremely large (say beyond 20 or 30), then you can still switch
> back to Poisson if you want.
Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here.
4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.
Could you recommend me what to do here?
Thanks
T
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi Eiko,
This is not a direct response to your question, but I thought you
might find these pages helpful:
On negative binomial:
http://www.ats.ucla.edu/stat/R/dae/nbreg.htmand zero inflated poisson:
http://www.ats.ucla.edu/stat/R/dae/zipoisson.htmIn general this page lists a variety of different models with examples:
http://www.ats.ucla.edu/stat/dae/I wrote the code for most of those pages. Some of your questions seema
little more specific than would be addressed on those general
introductory pages, but if there are particular things you would like
to see done, I am open to hearing about it.
One thing I do want to mention is that for publication or presentation
purposes, graphs of predicted counts can be quite nice for readersI
tried to show how to do that in all of those pages, and you can even
use bootstrapping (which I gave examples for at least in the zero
inflated models) to get robust confidence intervals.
Cheers,
Josh
On Sun, Oct 14, 2012 at 3:14 PM, Eiko Fried < [hidden email]> wrote:
> Thank you for the detailed answer, that was really helpful.
> I did some excessive reading and calculating in the last hours since your
> reply, and have a few (hopefully much more informed) follow up questions.
>
>
> 1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC values
> are listed for the models Negativebinomial (NB), ZeroInflated (ZI), ZI
> NB, Hurdle NB, and Poisson (Standard). And although I found a way to
> determine LLH and DF for all models, BIC & AIC values are not displayed by
> default, neither using the code given in the vignette. How do I calculate
> these? (AIC is given as per default only in 2 models, BIC in none).
>
>
> 2) For the zeroinflated models, the first block of count model
> coefficients is only in the output in order to compare the changes,
> correct? That is, in my results in a paper I would only report the second
> block of (zeroinflation) model coefficients? Or do I misunderstand
> something? I am confused because in their large table to compare
> coefficients, they primarily compare the first block of coefficients (p. 18)
> R> fm < list("MLPois" = fm_pois, "QuasiPois" = fm_qpois, "NB" = fm_nbin,
> + "HurdleNB" = fm_hurdle, "ZINB" = fm_zinb)
> R> sapply(fm, function(x) coef(x)[1:8])
>
>
> 3)
>
>> There are various formal tests for this, e.g., dispersiontest() in package
>> "AER".
>>
>
> I have to run 9 models  I am testing the influence of several predictors
> on different individual symptoms of a mental disorder, as "counted" in the
> last week (0=never in the last week, to 3 = on all day within the last
> week).
> So I'm regressing the same predictors onto 9 different symptoms in 9
> models.
>
> Dispersiontest() for these 9 models result in 34 overdispersed models
> (depending if testing one or twosided on p=.05 level), 2 underdispersed
> models, and 4 nonsignificant models. The by far largest dispersion value
> is 2.1 in a model is not overdispersed according to the test, but that's
> the symptom with 80% zeros, maybe that plays a role here.
>
> Does BN also make sense in underdispersed models?
>
> However, overdispersion can already matter before this is detected by a
>> significance test. Hence, if in doubt, I would simply use an NB model and
>> you're on the safe side. And if the NB's estimated theta parameter turns
>> out to be extremely large (say beyond 20 or 30), then you can still switch
>> back to Poisson if you want.
>
>
> Out of the 9 models, the 3 overdispersed NB models "worked" with theta
> estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
> appropriate here.
>
> 4 other models (including the 2 underdispersed ones) gave warnings (theta
> iteration / alternation limit reached), and although the other values look
> ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.
> Could you recommend me what to do here?
>
> Thanks
> T
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.

Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


just a side note for your 4th question.
for a small sample, clarke test instead of vuong test might be more
appropriate and the calculation is so simple that even excel can
handle it :)
On Sun, Oct 14, 2012 at 12:00 PM, Eiko Fried < [hidden email]> wrote:
> I would like to test in R what regression fits my data best. My dependent
> variable is a count, and has a lot of zeros.
>
> And I would need some help to determine what model and family to use
> (poisson or quasipoisson, or zeroinflated poisson regression), and how to
> test the assumptions.
>
> 1) Poisson Regression: as far as I understand, the strong assumption is
> that dependent variable mean = variance. How do you test this? How close
> together do they have to be? Are unconditional or conditional mean and
> variance used for this? What do I do if this assumption does not hold?
>
> 2) I read that if variance is greater than mean we have overdispersion, and
> a potential way to deal with this is including more independent variables,
> or family=quasipoisson. Does this distribution have any other requirements
> or assumptions? What test do I use to see whether 1) or 2) fits better 
> simply anova(m1,m2)?
>
> 3) I also read that negativebinomial distribution can be used when
> overdispersion appears. How do I do this in R? What is the difference to
> quasipoisson?
>
> 4) Zeroinflated Poisson Regression: I read that using the vuong test
> checks what models fits better.
>> vuong (model.poisson, model.zero.poisson)
> Is that correct?
>
> 5) ats.ucla.edu has a section about zeroinflated Poisson Regressions, and
> test the zeroinflated model (a) against the standard poisson model (b):
>> m.a < zeroinfl(count ~ child + camper  persons, data = zinb)
>> m.b < glm(count ~ child + camper, family = poisson, data = zinb)
>> vuong(m.a, m.b)
> I don't understand what the " persons" part of the first model does, and
> why you can compare these models if. I had expected the regression to be
> the same and just use a different family.
>
> Thank you
> T
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.

==============================
WenSui Liu
Credit Risk Manager, 53 Bancorp
[hidden email]
5132954370
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Sun, 14 Oct 2012, Eiko Fried wrote:
> Thank you for the detailed answer, that was really helpful. I did some
> excessive reading and calculating in the last hours since your reply,
> and have a few (hopefully much more informed) follow up questions.
>
> 1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC
> values are listed for the models Negativebinomial (NB), ZeroInflated
> (ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found a
> way to determine LLH and DF for all models, BIC & AIC values are not
> displayed by default, neither using the code given in the vignette. How
> do I calculate these? (AIC is given as per default only in 2 models, BIC
> in none).
The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleanedup version of the replication code.
Most likelihoodbased models provide a logLik() method which extracts the
fitted loglikelihood along with the corresponding degrees of freedom.
Then the default AIC() method can compute the AIC and AIC(..., k = log(n))
computes the corresponding BIC. This is hinted at in Table 3 of the
vignette. If "n", the number of observations, can be extracted by nobs(),
then also BIC() works. This is not yet the case for zeroinfl/hurdle,
though.
> 2) For the zeroinflated models, the first block of count model
> coefficients is only in the output in order to compare the changes,
> correct? That is, in my results in a paper I would only report the
> second block of (zeroinflation) model coefficients? Or do I
> misunderstand something?
No, no, yes. Hurdle and zeroinflation models have two linear predictors,
one for the zero hurdle/inflation and one for the truncated/uninflated
count component. Both are typically of interest for different aspects of
the data.
> I am
> confused because in their large table to compare coefficients, they
> primarily compare the first block of coefficients (p. 18)
> R> fm < list("MLPois" = fm_pois, "QuasiPois" = fm_qpois, "NB" = fm_nbin,
> + "HurdleNB" = fm_hurdle, "ZINB" = fm_zinb)
> R> sapply(fm, function(x) coef(x)[1:8])
All models considered have a predictor for the mean of the count
component, hence this can be compared across all models.
>
> 3)
> There are various formal tests for this, e.g., dispersiontest()
> in package "AER".
>
>
> I have to run 9 models  I am testing the influence of several
> predictors on different individual symptoms of a mental disorder, as
> "counted" in the last week (0=never in the last week, to 3 = on all day
> within the last week). So I'm regressing the same predictors onto 9
> different symptoms in 9 models.
That's not really a count response. I guess an ordered response model (see
e.g. polr() in MASS or package ordinal) would probably be more
appropriate.
> Dispersiontest() for these 9 models result in 34 overdispersed models
> (depending if testing one or twosided on p=.05 level), 2 underdispersed
> models, and 4 nonsignificant models. The by far largest dispersion value is
> 2.1 in a model is not overdispersed according to the test, but that's the
> symptom with 80% zeros, maybe that plays a role here.
>
> Does BN also make sense in underdispersed models?
The variance function of the NB2 model is mu + 1/theta * mu^2. Hence for
finite theta, the variance is always larger than the mean.
> However, overdispersion can already matter before this is
> detected by a significance test. Hence, if in doubt, I would
> simply use an NB model and you're on the safe side. And if the
> NB's estimated theta parameter turns out to be extremely large
> (say beyond 20 or 30), then you can still switch back to Poisson
> if you want.
>
> Out of the 9 models, the 3 overdispersed NB models "worked" with theta
> estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
> appropriate here.
>
> 4 other models (including the 2 underdispersed ones) gave warnings (theta
> iteration / alternation limit reached), and although the other values look
> ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.
> Could you recommend me what to do here?
As I said before: A theta around 20 or 30 is already so large that it is
virtually identical to a Poisson fit (theta = Inf). These values clearly
indicate that theta is not finite.
However, this almost certainly stems from your response which is not
really count data. As I said above: An ordered response model will
probably work better here. If you have mainly variation between 0 vs.
larger but not much among the levels 1/2/3, a binary response (0 vs.
larger) may be best.
hth,
Z
> Thanks
> T
>
> ______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

