|
|
Hello,
I have a question about modelling via glm. I have a dataset (see dput)
that looks like as if it where poisson distributed (actually I would
appreciate that) but it isnt because mean unequals var.
> mean (x)
[1] 901.7827
> var (x)
[1] 132439.3
Anyway, I tried to model it via poisson and quasipoisson. Actually, just to
get an impression how glm works. But I dont know how to interprete the
data. Of course this is the case because my knowledge concerning logistic
regressions is rather limited. Hoping there is somebody with mercy I would
like to understand which parameters are important, e.g. which paramter
might give me a hint that a poisson model is a bad idea. For hints
concerning some tutorials about reading glm-output I would appreciate as
well.
Thanks
Wim
> skn300.glmp <- glm (freq~n, data=skn300.tab, family=poisson)
> summary (skn300.glmp)
Call:
glm(formula = freq ~ n, family = poisson, data = skn300.tab)
Deviance Residuals:
Min 1Q Median 3Q Max
-51.332 -9.383 -6.599 -3.959 55.111
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.2374375 0.0093285 775.8 <2e-16 ***
n -0.0539424 0.0003699 -145.8 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 71731 on 96 degrees of freedom
Residual deviance: 37383 on 95 degrees of freedom
AIC: 37800
Number of Fisher Scoring iterations: 6
>
> skn300.glmq <- glm (freq~n, data=skn300.tab, family=quasipoisson)
> summary (skn300.glmq)
Call:
glm(formula = freq ~ n, family = quasipoisson, data = skn300.tab)
Deviance Residuals:
Min 1Q Median 3Q Max
-51.332 -9.383 -6.599 -3.959 55.111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.237438 0.186381 38.831 < 2e-16 ***
n -0.053942 0.007391 -7.298 8.8e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for quasipoisson family taken to be 399.1874)
Null deviance: 71731 on 96 degrees of freedom
Residual deviance: 37383 on 95 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
> dput (skn300.tab)
structure(list(n = 1:97, freq = c(0L, 0L, 0L, 0L, 1L, 7L, 40L,
100L, 276L, 543L, 952L, 1414L, 1853L, 2199L, 2435L, 2270L, 2042L,
1679L, 1386L, 1108L, 922L, 792L, 642L, 597L, 453L, 424L, 370L,
297L, 278L, 218L, 208L, 172L, 174L, 149L, 124L, 98L, 98L, 67L,
78L, 67L, 46L, 34L, 31L, 42L, 34L, 21L, 28L, 18L, 18L, 18L, 10L,
19L, 6L, 9L, 10L, 6L, 6L, 5L, 3L, 9L, 4L, 3L, 4L, 5L, 2L, 6L,
4L, 2L, 2L, 3L, 3L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 2L, 1L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 1L),
kum = c(0L, 0L, 0L, 0L, 1L, 8L, 48L, 148L, 424L, 967L, 1919L,
3333L, 5186L, 7385L, 9820L, 12090L, 14132L, 15811L, 17197L,
18305L, 19227L, 20019L, 20661L, 21258L, 21711L, 22135L, 22505L,
22802L, 23080L, 23298L, 23506L, 23678L, 23852L, 24001L, 24125L,
24223L, 24321L, 24388L, 24466L, 24533L, 24579L, 24613L, 24644L,
24686L, 24720L, 24741L, 24769L, 24787L, 24805L, 24823L, 24833L,
24852L, 24858L, 24867L, 24877L, 24883L, 24889L, 24894L, 24897L,
24906L, 24910L, 24913L, 24917L, 24922L, 24924L, 24930L, 24934L,
24936L, 24938L, 24941L, 24944L, 24944L, 24944L, 24944L, 24944L,
24946L, 24947L, 24947L, 24947L, 24947L, 24947L, 24947L, 24948L,
24948L, 24948L, 24949L, 24951L, 24952L, 24952L, 24952L, 24952L,
24952L, 24954L, 24954L, 24954L, 24954L, 24955L)), .Names = c("n",
"freq", "kum"), row.names = c(NA, -97L), class = "data.frame")
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
|
|
I am sure, that this is not a pure Poisson! Huge overdispersion!
You get inflated confidence intervals!
(although, the point estimates of the regression coefficients stay the same)
Try to look for the causes of overdispersion! It may be geteroscedastisity?
What is the nature of the response, is it the positive integers?
Perhaps in your model still missing something important predictors?
Or just you can try the Gamma or log-normal distr.
Ivan, IPAE UB RAS
|
|
On Thu, Jan 31, 2013 at 2:13 PM, Wim Kreinen < [hidden email]> wrote:
> Hello,
>
> I have a question about modelling via glm.
I think you are way off track. Either the data, glm, or both, are not what
you think they are.
> I have a dataset
skn300.tab <- structure(list(n = 1:97, freq = c(0L, 0L, 0L, 0L, 1L, 7L, 40L,
100L, 276L, 543L, 952L, 1414L, 1853L, 2199L, 2435L, 2270L, 2042L,
1679L, 1386L, 1108L, 922L, 792L, 642L, 597L, 453L, 424L, 370L,
297L, 278L, 218L, 208L, 172L, 174L, 149L, 124L, 98L, 98L, 67L,
78L, 67L, 46L, 34L, 31L, 42L, 34L, 21L, 28L, 18L, 18L, 18L, 10L,
19L, 6L, 9L, 10L, 6L, 6L, 5L, 3L, 9L, 4L, 3L, 4L, 5L, 2L, 6L,
4L, 2L, 2L, 3L, 3L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 2L, 1L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 1L),
kum = c(0L, 0L, 0L, 0L, 1L, 8L, 48L, 148L, 424L, 967L, 1919L,
3333L, 5186L, 7385L, 9820L, 12090L, 14132L, 15811L, 17197L,
18305L, 19227L, 20019L, 20661L, 21258L, 21711L, 22135L, 22505L,
22802L, 23080L, 23298L, 23506L, 23678L, 23852L, 24001L, 24125L,
24223L, 24321L, 24388L, 24466L, 24533L, 24579L, 24613L, 24644L,
24686L, 24720L, 24741L, 24769L, 24787L, 24805L, 24823L, 24833L,
24852L, 24858L, 24867L, 24877L, 24883L, 24889L, 24894L, 24897L,
24906L, 24910L, 24913L, 24917L, 24922L, 24924L, 24930L, 24934L,
24936L, 24938L, 24941L, 24944L, 24944L, 24944L, 24944L, 24944L,
24946L, 24947L, 24947L, 24947L, 24947L, 24947L, 24947L, 24948L,
24948L, 24948L, 24949L, 24951L, 24952L, 24952L, 24952L, 24952L,
24952L, 24954L, 24954L, 24954L, 24954L, 24955L)), .Names = c("n",
"freq", "kum"), row.names = c(NA, -97L), class = "data.frame")
> that looks like as if it where poisson distributed (actually I would
> appreciate that) but it isnt
plot(skn300.tab)
My guess, we are looking at the pdf and cdf (maybe even of a Poisson
process), but not at any "data" that lends itself to a (generalized) linear
model. Consult a statistician, post on stackexchange, read about
regression, or better define your actual R problem here, demonstrating this
is not homework - see the posting guide.
Cheers
> because mean unequals var.
>
>
> > mean (x)
> [1] 901.7827
> > var (x)
> [1] 132439.3
>
>
> Anyway, I tried to model it via poisson and quasipoisson. Actually, just to
> get an impression how glm works. But I dont know how to interprete the
> data. Of course this is the case because my knowledge concerning logistic
> regressions is rather limited. Hoping there is somebody with mercy I would
> like to understand which parameters are important, e.g. which paramter
> might give me a hint that a poisson model is a bad idea. For hints
> concerning some tutorials about reading glm-output I would appreciate as
> well.
>
> Thanks
> Wim
>
>
> > skn300.glmp <- glm (freq~n, data=skn300.tab, family=poisson)
> > summary (skn300.glmp)
>
> Call:
> glm(formula = freq ~ n, family = poisson, data = skn300.tab)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -51.332 -9.383 -6.599 -3.959 55.111
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 7.2374375 0.0093285 775.8 <2e-16 ***
> n -0.0539424 0.0003699 -145.8 <2e-16 ***
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> (Dispersion parameter for poisson family taken to be 1)
>
> Null deviance: 71731 on 96 degrees of freedom
> Residual deviance: 37383 on 95 degrees of freedom
> AIC: 37800
>
> Number of Fisher Scoring iterations: 6
>
> >
> > skn300.glmq <- glm (freq~n, data=skn300.tab, family=quasipoisson)
> > summary (skn300.glmq)
>
> Call:
> glm(formula = freq ~ n, family = quasipoisson, data = skn300.tab)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -51.332 -9.383 -6.599 -3.959 55.111
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 7.237438 0.186381 38.831 < 2e-16 ***
> n -0.053942 0.007391 -7.298 8.8e-11 ***
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> (Dispersion parameter for quasipoisson family taken to be 399.1874)
>
> Null deviance: 71731 on 96 degrees of freedom
> Residual deviance: 37383 on 95 degrees of freedom
> AIC: NA
>
> Number of Fisher Scoring iterations: 6
>
>
> > dput (skn300.tab)
> structure(list(n = 1:97, freq = c(0L, 0L, 0L, 0L, 1L, 7L, 40L,
> 100L, 276L, 543L, 952L, 1414L, 1853L, 2199L, 2435L, 2270L, 2042L,
> 1679L, 1386L, 1108L, 922L, 792L, 642L, 597L, 453L, 424L, 370L,
> 297L, 278L, 218L, 208L, 172L, 174L, 149L, 124L, 98L, 98L, 67L,
> 78L, 67L, 46L, 34L, 31L, 42L, 34L, 21L, 28L, 18L, 18L, 18L, 10L,
> 19L, 6L, 9L, 10L, 6L, 6L, 5L, 3L, 9L, 4L, 3L, 4L, 5L, 2L, 6L,
> 4L, 2L, 2L, 3L, 3L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 0L, 0L,
> 1L, 0L, 0L, 1L, 2L, 1L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 1L),
> kum = c(0L, 0L, 0L, 0L, 1L, 8L, 48L, 148L, 424L, 967L, 1919L,
> 3333L, 5186L, 7385L, 9820L, 12090L, 14132L, 15811L, 17197L,
> 18305L, 19227L, 20019L, 20661L, 21258L, 21711L, 22135L, 22505L,
> 22802L, 23080L, 23298L, 23506L, 23678L, 23852L, 24001L, 24125L,
> 24223L, 24321L, 24388L, 24466L, 24533L, 24579L, 24613L, 24644L,
> 24686L, 24720L, 24741L, 24769L, 24787L, 24805L, 24823L, 24833L,
> 24852L, 24858L, 24867L, 24877L, 24883L, 24889L, 24894L, 24897L,
> 24906L, 24910L, 24913L, 24917L, 24922L, 24924L, 24930L, 24934L,
> 24936L, 24938L, 24941L, 24944L, 24944L, 24944L, 24944L, 24944L,
> 24946L, 24947L, 24947L, 24947L, 24947L, 24947L, 24947L, 24948L,
> 24948L, 24948L, 24949L, 24951L, 24952L, 24952L, 24952L, 24952L,
> 24952L, 24954L, 24954L, 24954L, 24954L, 24955L)), .Names = c("n",
> "freq", "kum"), row.names = c(NA, -97L), class = "data.frame")
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
|
|