Saturated model in binomial glm

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Saturated model in binomial glm

Giovanni Petris

Hi all,

Could somebody be so kind to explain to me what is the saturated model
on which deviance and degrees of freedom are calculated when fitting a
binomial glm?

Everything makes sense if I fit the model using as response a vector of
proportions or a two-column matrix. But when the response is a factor
and counts are specified via the "weights" argument, I am kind of lost
as far as what is the implied saturated model.

Here is a simple example, based on the UCBAdmissions data.

> UCBAd <- as.data.frame(UCBAdmissions)
> UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
+                         weights = Freq, data = UCBAd)
> UCBAd$deviance
[1] 5187.488
> UCBAd$df.residual
[1] 17

I can see that the data frame UCBAd has 24 rows and using 1+1+5
parameters to fit the model leaves me with 17 degrees of freedom.

What is not clear to me is what is the saturated model?

Is it the model that fits a probability zero to each row corresponding
to failures and a probability one to each row corresponding to
successes? If this is so, it seems to me that looking at the deviance as
a goodness-of-fit statistic does not make much sense in this case. Am I
missing something?

Thank you in advance,
Giovanni


--

Giovanni Petris  <[hidden email]>
Associate Professor
Department of Mathematical Sciences
University of Arkansas - Fayetteville, AR 72701
Ph: (479) 575-6324, 575-8630 (fax)
http://definetti.uark.edu/~gpetris/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Saturated model in binomial glm

Bill.Venables
This is a very good question.  You have spotted something that not many people see and it is important.

The bland assertion that the "deviance can be used as a test of fit" can be seriously misleading.

For this data the response is clearly binary, "Admitted" (success) or "Rejected" (failure) and the other two factors are explanatory variables.  

Any binomial model can be fitted by expressing the data as a binary response.  In this case there are

> sum(UCBAd$Freq)
[1] 4526
>

4526 trials, corresponding to the individual applicants for admission.  We can expand the data frame right out to this level and fit the model with the data in that form, and in this case the weights will be the default, ie. all 1's.

We can also *group* the data into subsets which are homogeneous with respect to the explanatory variables.  

The most succinct grouping would be into 12 groups corresponding to the 2 x 6 distinct classes defined by the two explanatory variables.  In this case you would specify the response either as a two-column matrix of successes/failures, or as a proportion with the totals for each of the 12 cases as weights.

Another succince grouping is into 2 x 2 x 6 classes as you do in your example.  In this case your response is the factor and the weights are the frequencies.

For all three cases a) the estimates will be the same, and so the predictions will be identical, b) the deviance will also be the same, but c) the degrees of freedom attributed to the deviance will be different.

The reason for c) is, as you have intuited, the saturated model is different.  Literally, the saturated model is a model with one mean parameter for each value taken as an observation when the model is fitted.  So the saturated model is *not* invariant with respect to grouping.

Let's look at two of these cases computationally:


> UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame
> dim(UCB_Expanded)
[1] 4526    3

>  ### now fit your model

> m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq)

>  ### and the same model using the binary data form

> m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded)

>  ### as predicted, the coefficients are identical (up to round off)

> coef(m1)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574
       DeptF
  3.30648006
> coef(m2)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574
       DeptF
  3.30648006

>   ### and so are the deviances, perhaps surprisingly:

> deviance(m1)
[1] 5187.488
> deviance(m2)
[1] 5187.488

>   ### but the degrees of freedom attributed to the deviance are different!

> m1$df.resid
[1] 17
> m2$df.resid
[1] 4519
>
 
If you were to fit the model in the most succinct form, with 12 relative frequencies, then you would get the same deviance again, but the degrees of freedom would be only 5.

So you need to be very careful in taking the deviance, even in binomial models, as a test of fit.  The way the data are grouped is relevant.

If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, then

The estimated coefficients, and their standard errors, vcov matrix,
The deviance, and so
*Differences* in deviances and
*Differences* in degrees of freedom

will be the same however the data are grouped, and so the usual tests and CI processes go ahead fine.

But the deviance itself can be misleading as a test of fit, since the outer hypothesis, the saturated model, is not fixed and depends on grouping.  For the ungrouped binary case it is *usually* misleading when taken simply at face value as chi-squared distributed under the null hypothesis.

I think there is a book or two around that discusses this issue, but probably not well enough.

Bill Venables.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Giovanni Petris
Sent: Thursday, 17 February 2011 7:58 AM
To: [hidden email]
Subject: [R] Saturated model in binomial glm


Hi all,

Could somebody be so kind to explain to me what is the saturated model
on which deviance and degrees of freedom are calculated when fitting a
binomial glm?

Everything makes sense if I fit the model using as response a vector of
proportions or a two-column matrix. But when the response is a factor
and counts are specified via the "weights" argument, I am kind of lost
as far as what is the implied saturated model.

Here is a simple example, based on the UCBAdmissions data.

> UCBAd <- as.data.frame(UCBAdmissions)
> UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
+                         weights = Freq, data = UCBAd)
> UCBAd$deviance
[1] 5187.488
> UCBAd$df.residual
[1] 17

I can see that the data frame UCBAd has 24 rows and using 1+1+5
parameters to fit the model leaves me with 17 degrees of freedom.

What is not clear to me is what is the saturated model?

Is it the model that fits a probability zero to each row corresponding
to failures and a probability one to each row corresponding to
successes? If this is so, it seems to me that looking at the deviance as
a goodness-of-fit statistic does not make much sense in this case. Am I
missing something?

Thank you in advance,
Giovanni


--

Giovanni Petris  <[hidden email]>
Associate Professor
Department of Mathematical Sciences
University of Arkansas - Fayetteville, AR 72701
Ph: (479) 575-6324, 575-8630 (fax)
http://definetti.uark.edu/~gpetris/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Saturated model in binomial glm

Giovanni Petris
Dear Bill,

Thank you very much for your careful discussion of the issue.

It is not surprising that the deviance is the same whether you fit the
model using a factor response with weights or individual 0/1 responses.
I think this happens because the fitted probabilities in the saturated
models are zero or one (depending on the response being a failure or a
success) in both cases. The degrees of freedom are clearly different.

However, if you consider as response the two-column matrix of successes
and failures then you have a different deviance, with different degrees
of freedom:

> covariates <- subset(UCBAd, subset = Admit == "Admitted",
+                      select = -c(Admit, Freq))
> AdmitReject <- as.matrix(cbind(subset(UCBAd, subset = Admit == "Rejected",
+                                       select = Freq),
+                                subset(UCBAd, subset = Admit == "Admitted",
+                                       select = Freq)))
> UCBAd_matrix <- glm(AdmitReject ~ Gender + Dept, family = binomial,
+                     data = covariates) ## matrix as response
> UCBAd_matrix$deviance
[1] 20.20428
> UCBAd_matrix$df.residual
[1] 5
> coef(UCBAd_matrix)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574
       DeptF
  3.30648006

In this case the fitted probabilities in the saturated model are the
observed proportions. The fitted coefficients are clearly the same.

The deviance (and degrees of freedom) resulting from this way of fitting
the model _can_ be used as a goodness-of-fit statistic, based on
standard arguments, while the deviance (and degrees of freedom)
resulting from fitting the model as individual 0/1 responses _can't_.

What about deviance and degrees of freedom that I get using the "factor
with weights" approach? The number of parameters in the saturated model
does not increase with the number of observations, as it happens in the
individual 0/1 responses case, but the fitted probabilities in the
saturated model are fixed to zero and one. I guess it is not clear to me
that the standard likelihood ratio test asymptotic theory holds in this
case. Could anybody clarify?

Thank you in advance,
Giovanni


On Thu, 2011-02-17 at 10:50 +1100, [hidden email] wrote:

> This is a very good question.  You have spotted something that not
> many people see and it is important.
>
> The bland assertion that the "deviance can be used as a test of fit"
> can be seriously misleading.
>
> For this data the response is clearly binary, "Admitted" (success) or
> "Rejected" (failure) and the other two factors are explanatory
> variables.  
>
> Any binomial model can be fitted by expressing the data as a binary
> response.  In this case there are
>
> > sum(UCBAd$Freq)
> [1] 4526
> >
>
> 4526 trials, corresponding to the individual applicants for admission.
> We can expand the data frame right out to this level and fit the model
> with the data in that form, and in this case the weights will be the
> default, ie. all 1's.
>
> We can also *group* the data into subsets which are homogeneous with
> respect to the explanatory variables.  
>
> The most succinct grouping would be into 12 groups corresponding to
> the 2 x 6 distinct classes defined by the two explanatory variables.
> In this case you would specify the response either as a two-column
> matrix of successes/failures, or as a proportion with the totals for
> each of the 12 cases as weights.
>
> Another succince grouping is into 2 x 2 x 6 classes as you do in your
> example.  In this case your response is the factor and the weights are
> the frequencies.
>
> For all three cases a) the estimates will be the same, and so the
> predictions will be identical, b) the deviance will also be the same,
> but c) the degrees of freedom attributed to the deviance will be
> different.
>
> The reason for c) is, as you have intuited, the saturated model is
> different.  Literally, the saturated model is a model with one mean
> parameter for each value taken as an observation when the model is
> fitted.  So the saturated model is *not* invariant with respect to
> grouping.
>
> Let's look at two of these cases computationally:
>
>
> > UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame
> > dim(UCB_Expanded)
> [1] 4526    3
>
> >  ### now fit your model
>
> > m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq)
>
> >  ### and the same model using the binary data form
>
> > m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded)
>
> >  ### as predicted, the coefficients are identical (up to round off)
>
> > coef(m1)
>  (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE
>  -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574
>        DeptF
>   3.30648006
> > coef(m2)
>  (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE
>  -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574
>        DeptF
>   3.30648006
>
> >   ### and so are the deviances, perhaps surprisingly:
>
> > deviance(m1)
> [1] 5187.488
> > deviance(m2)
> [1] 5187.488
>
> >   ### but the degrees of freedom attributed to the deviance are different!
>
> > m1$df.resid
> [1] 17
> > m2$df.resid
> [1] 4519
> >
>  
> If you were to fit the model in the most succinct form, with 12
> relative frequencies, then you would get the same deviance again, but
> the degrees of freedom would be only 5.
>
> So you need to be very careful in taking the deviance, even in
> binomial models, as a test of fit.  The way the data are grouped is
> relevant.
>
> If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, then
>
> The estimated coefficients, and their standard errors, vcov matrix,
> The deviance, and so
> *Differences* in deviances and
> *Differences* in degrees of freedom
>
> will be the same however the data are grouped, and so the usual tests
> and CI processes go ahead fine.
>
> But the deviance itself can be misleading as a test of fit, since the
> outer hypothesis, the saturated model, is not fixed and depends on
> grouping.  For the ungrouped binary case it is *usually* misleading
> when taken simply at face value as chi-squared distributed under the
> null hypothesis.
>
> I think there is a book or two around that discusses this issue, but
> probably not well enough.
>
> Bill Venables.
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Giovanni Petris
> Sent: Thursday, 17 February 2011 7:58 AM
> To: [hidden email]
> Subject: [R] Saturated model in binomial glm
>
>
> Hi all,
>
> Could somebody be so kind to explain to me what is the saturated model
> on which deviance and degrees of freedom are calculated when fitting a
> binomial glm?
>
> Everything makes sense if I fit the model using as response a vector of
> proportions or a two-column matrix. But when the response is a factor
> and counts are specified via the "weights" argument, I am kind of lost
> as far as what is the implied saturated model.
>
> Here is a simple example, based on the UCBAdmissions data.
>
> > UCBAd <- as.data.frame(UCBAdmissions)
> > UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
> +                         weights = Freq, data = UCBAd)
> > UCBAd$deviance
> [1] 5187.488
> > UCBAd$df.residual
> [1] 17
>
> I can see that the data frame UCBAd has 24 rows and using 1+1+5
> parameters to fit the model leaves me with 17 degrees of freedom.
>
> What is not clear to me is what is the saturated model?
>
> Is it the model that fits a probability zero to each row corresponding
> to failures and a probability one to each row corresponding to
> successes? If this is so, it seems to me that looking at the deviance as
> a goodness-of-fit statistic does not make much sense in this case. Am I
> missing something?
>
> Thank you in advance,
> Giovanni
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.