# Chi square value of anova(binomialglmnull,binomglmmod,test="Chisq")

## Chi square value of anova(binomialglmnull,binomglmmod,test="Chisq")

 Hi all, I have done a backward stepwise selection on a full binomial GLM where the response variable is gender. At the end of the selection I have found one model with only one explanatory variable (cohort, factor variable with 10 levels). I want to test the significance of the variable "cohort" that, I believe, is the same as the significance of this selected model: > anova(mod4,update(mod4,~.-cohort),test="Chisq") Analysis of Deviance Table Model 1: site ~ cohort Model 2: site ~ 1   Resid. Df Resid. Dev Df Deviance P(>|Chi|)     1       993     1283.7                           2      1002     1368.2 -9  -84.554 2.002e-14 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 My question is: When I report this result, I would say "cohorts were unevenly distributed between sites ( Chi2=84.5, df=9, p < 0.001)", is that correct? is the Chi2 value the difference of deviance between model with cohort effect and null model?
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 On Jun 4, 2012, at 7:00 AM, lincoln wrote:
> Hi all,
>
> I have done a backward stepwise selection on a full binomial GLM   
> where the
> response variable is gender.
> At the end of the selection I have found one model with only one   
> explanatory
> variable (cohort, factor variable with 10 levels).
>
> I want to test the significance of the variable "cohort" that, I   
> believe, is
> the same as the significance of this selected model:
>
>> anova(mod4,update(mod4,~.-cohort),test="Chisq")
> Analysis of Deviance Table
>
> Model 1: site ~ cohort
> Model 2: site ~ 1
>  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
> 1       993     1283.7
> 2      1002     1368.2 -9  -84.554 2.002e-14 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> My question is:
> When I report this result, I would say /"cohorts were unevenly   
> distributed
> between sites ( Chi2=84.5, df=9, p < 0.001)"/, is that correct? is   
> the Chi2
> value the difference of deviance between model with cohort effect   
> and null
> model?

I thought you said the response variable was gender? It seems to be   
'site' in these two models. Maybe you should give us some more   
information about how you constructed 'mod4'?

-- David Winsemius, MD
West Hartford, CT
 So sorry, My response variable is "site" (not "gender"!). The selection process was: > str(data) 'data.frame': 1003 obs. of  5 variables:  $site : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ sex   : Factor w/ 2 levels "0","1": NA NA NA NA 1 NA NA NA NA NA ...  $age : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...$ cohort: Factor w/ 10 levels "1999","2000",..: 10 10 10 10 10 10 10 10 10 10 ...  $birth : Factor w/ 3 levels "5","6","7": 3 3 2 2 2 2 2 2 2 2 ... > datasex<-subset(data, sex !="NA") Here below the structure of the analysis and only the anova.glm of the last, selected model, mod4: >mod1 <- glm(site ~ sex + birth + cohort + sex:birth, data=datasex, family = binomial) >summary(mod1) >anova(mod1,update(mod1,~.-sex:birth),test="Chisq") >mod2 <- glm(site ~ sex + birth + cohort, data=datasex, family = binomial) >summary(mod2) >anova(mod2,update(mod2,~.-sex),test="Chisq") >mod3 <- glm(site ~ birth + cohort, data=data, family = binomial) >summary(mod3) >anova(mod3,update(mod3,~.-birth),test="Chisq") >mod4 <- glm(site ~ cohort, data=data, family = binomial) >summary(mod4) >anova(mod4,update(mod4,~.-cohort),test="Chisq") Analysis of Deviance Table Model 1: site ~ cohort Model 2: site ~ 1 Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 993 1283.7 2 1002 1368.2 -9 -84.554 2.002e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 My question:In this case, the Chi2 value would be the difference in deviance between models and d.f. the difference in d.f. (84.554 and 9)? In other words may I correctly assess: "cohorts were unevenly distributed between sites ( Chi2=84.5, df=9, p < 0.001)"? Reply | Threaded Open this post in threaded view | Report Content as Inappropriate ## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")  On Jun 4, 2012, at 11:31 AM, lincoln wrote: > So sorry, > > My response variable is "site" (not "gender"!). > The selection process was: > If there is a natural probability interpretation to "site"==1 being a sort of event, (say perhaps a non-lymphatic site for the primary site of a lymphoma) then you can say that the log-odds for 'site' being 1 compared to the log-odds for being 0 are different among the cohorts. (Or equivalently that the odds ratios are "significantly" different.) Worries: The fact that 'age' codes are 1/0 and' birth' is 5,6,or 7 makes me wonder what sort of measurements these are. I worry when variables usually considered as continuous get so severely discretized. The fact that this is data measured over time also raised further concerns about independence. Were controls observed in 1999 still subject to risk in 2000 and subsequent years? Were there substantial differences in the time to events? I also worry when words normally used as a location are interpreted as events and there is no context offered. -- David. >> str(data) > 'data.frame': 1003 obs. of 5 variables: >$ site  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... > $sex : Factor w/ 2 levels "0","1": NA NA NA NA 1 NA NA NA NA NA ... >$ age   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... > $cohort: Factor w/ 10 levels "1999","2000",..: 10 10 10 10 10 10 10 > 10 10 > 10 ... >$ birth : Factor w/ 3 levels "5","6","7": 3 3 2 2 2 2 2 2 2 2 ... >> datasex<-subset(data, sex !="NA") > > *Here below the structure of the analysis and only the anova.glm of   > the > last, selected model, mod4: > * >> mod1 <- glm(site ~ sex + birth + cohort + sex:birth, data=datasex,   >> family = > binomial) >> summary(mod1) >> anova(mod1,update(mod1,~.-sex:birth),test="Chisq") > >> mod2 <- glm(site ~ sex + birth + cohort, data=datasex, family =   >> binomial) >> summary(mod2) >> anova(mod2,update(mod2,~.-sex),test="Chisq") > >> mod3 <- glm(site ~ birth + cohort, data=data, family = binomial) >> summary(mod3) >> anova(mod3,update(mod3,~.-birth),test="Chisq") > >> mod4 <- glm(site ~ cohort, data=data, family = binomial) >> summary(mod4) >> anova(mod4,update(mod4,~.-cohort),test="Chisq") > Analysis of Deviance Table > > Model 1: site ~ cohort > Model 2: site ~ 1 >  Resid. Df Resid. Dev Df Deviance P(>|Chi|) > 1       993     1283.7 > 2      1002     1368.2 -9  -84.554 2.002e-14 *** > --- > Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > *My question:* > In this case, the Chi2 value would be the difference in deviance   > between > models and d.f. the difference in d.f. (84.554 and 9)? > In other words may I correctly assess: /"cohorts were unevenly   > distributed > between sites ( Chi2=84.5, df=9, p < 0.001)"/? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Chi-square-value-of-anova-binomialglmnull-binomglmmod-test-Chisq-tp4632293p4632312.html> Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD
West Hartford, CT
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 Thank you for your commentaries and suggestions. Site 0 and site 1 are interpretable like events. In fact these data come from a simultaneous observations of individuals in two different sites (so they are independent observations: while one individual is observed in one site it can't be in another). Each individual is assigned to age "0" (first year of age), or "1" (all the rest); even though it may seem a very strong (brutal?) pooling, from a biological point of view it makes sense given these two classes of individuals are quite homogeneous in their dispersal behavior within each age class (0 or 1). The goal of this analysis is just to characterize their dispersal behavior (which individuals stay home at site 0 and which ones disperse to site 1? About the "birth" issue, here I am more in doubt. "Birth" relates to the month of birth (5= May, 6= June, 7= July). It seems to me too it is a quite severe pooling (one individual born 1st June is 5 as one individual born 30th June but one individual born 30th May or 1st July is 4 or 6 - it doesn't make much sense). Anyway I didn't find a way to better measure this variable as there is no a real starting and ending point, more or less individuals may born since 1st May up to 31th July (I mean in my data set there are no individuals born before and after these dates). Any hint?
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 On Jun 5, 2012, at 4:52 AM, lincoln wrote:

> Thank you for your commentaries and suggestions.
>
> Site 0 and site 1 are interpretable like events.
> In fact these data come from a simultaneous observations of   
> individuals in
> two different sites (so they are independent observations: while one
> individual is observed in one site it can't be in another).
>
> Each individual is assigned to age "0" (first year of age), or   
> "1" (all the
> rest); even though it may seem a very strong (brutal?) pooling, from a
> biological point of view it makes sense given these two classes of
> individuals are quite homogeneous in their dispersal behavior within   
> each
> age class (0 or 1). The goal of this analysis is just to   
> characterize their
> dispersal behavior (which individuals stay home at site 0 and which   
> ones
> disperse to site 1?

This is making me think you really have multiple observation on the   
same individuals (and that persons make transitions from one state to   
another as a result of the passage of time. That needs a more complex   
analysis than "simple" logistic regression. You might consider posting   
a more complete description of the study on the SIG Mixed Effects   
mailing list.

-- David.

>
> About the "birth" issue, here I am more in doubt. "Birth" relates to   
> the
> month of birth (5= May, 6= June, 7= July). It seems to me too it is   
> a quite
> severe pooling (one individual born 1st June is 5 as one individual   
> born
> 30th June but one individual born 30th May or 1st July is 4 or 6 - it
> doesn't make much sense). Anyway I didn't find a way to better   
> measure this
> variable as there is no a real starting and ending point, more or less
> individuals may born since 1st May up to 31th July (I mean in my   
> data set
> there are no individuals born before and after these dates).
>
> Any hint?

David Winsemius, MD
West Hartford, CT
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 David Winsemius wrote This is making me think you really have multiple observation on the   same individuals (and that persons make transitions from one state to   another as a result of the passage of time. That needs a more complex   analysis than "simple" logistic regression. You might consider posting   a more complete description of the study on the SIG Mixed Effects   mailing list. -- David. No, I haven't. Individuals are birds marked with an unique alphanumeric code that gives me information on their gender (sometimes I have this data sometime I haven't), and their birth date (as a consequence also the age). There are no multiple observations of the same individual. Anyway, I believe I have not been answered to the main question: when using anova with test "Chisq" between two models, is the difference in deviance between the two models interpretable as the Chi Square value and the difference in df interpretable as the df of the Chi square test? For instance, given: > anova(mod4,update(mod4,~.-cohort),test="Chisq") Analysis of Deviance Table Model 1: site ~ cohort Model 2: site ~ 1   Resid. Df Resid. Dev Df Deviance P(>|Chi|)     1       993     1283.7                           2      1002     1368.2 -9  -84.554 2.002e-14 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Is 84.554 taken as the Chi square value, 9 as the df of the test and the p-value depending on these two values?
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 On Jun 6, 2012, at 10:59 , lincoln wrote:

>
> David Winsemius wrote
>>
>> This is making me think you really have multiple observation on the   
>> same individuals (and that persons make transitions from one state to   
>> another as a result of the passage of time. That needs a more complex   
>> analysis than "simple" logistic regression. You might consider posting   
>> a more complete description of the study on the SIG Mixed Effects   
>> mailing list.
>>
>> --
>> David.
>>
>
> No, I haven't. Individuals are birds marked with an unique alphanumeric code
> that gives me information on their gender (sometimes I have this data
> sometime I haven't), and their birth date (as a consequence also the age).
> There are no multiple observations of the same individual.
>
> Anyway, I believe I have not been answered to the main question: when using
> anova with test "Chisq" between two models, is the difference in deviance
> between the two models interpretable as the Chi Square value and the
> difference in df interpretable as the df of the Chi square test?
>
> For instance, given:
>
>> anova(mod4,update(mod4,~.-cohort),test="Chisq")
> Analysis of Deviance Table
>
> Model 1: site ~ cohort
> Model 2: site ~ 1
>  Resid. Df Resid. Dev Df Deviance P(>|Chi|)     
> 1       993     1283.7                           
> 2      1002     1368.2 -9  -84.554 2.002e-14 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Is 84.554 taken as the Chi square value, 9 as the df of the test and the
> p-value depending on these two values?

That's the general mechanism, yes. (Whether the chi-square distribution holds after variable selection is a more difficult issue. Frank Harrell might chime in and remind us that there are books on that subject.)

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 On Jun 6, 2012, at 9:36 AM, peter dalgaard wrote:

>
> On Jun 6, 2012, at 10:59 , lincoln wrote:
>
>>
>> David Winsemius wrote
>>>
>>> This is making me think you really have multiple observation on the   
>>> same individuals (and that persons make transitions from one state to   
>>> another as a result of the passage of time. That needs a more complex   
>>> analysis than "simple" logistic regression. You might consider posting   
>>> a more complete description of the study on the SIG Mixed Effects   
>>> mailing list.
>>>
>>> --
>>> David.
>>>
>>
>> No, I haven't. Individuals are birds marked with an unique alphanumeric code
>> that gives me information on their gender (sometimes I have this data
>> sometime I haven't), and their birth date (as a consequence also the age).
>> There are no multiple observations of the same individual.
>>
>> Anyway, I believe I have not been answered to the main question: when using
>> anova with test "Chisq" between two models, is the difference in deviance
>> between the two models interpretable as the Chi Square value and the
>> difference in df interpretable as the df of the Chi square test?
>>
>> For instance, given:
>>
>>> anova(mod4,update(mod4,~.-cohort),test="Chisq")
>> Analysis of Deviance Table
>>
>> Model 1: site ~ cohort
>> Model 2: site ~ 1
>> Resid. Df Resid. Dev Df Deviance P(>|Chi|)     
>> 1       993     1283.7                           
>> 2      1002     1368.2 -9  -84.554 2.002e-14 ***
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>
>> Is 84.554 taken as the Chi square value, 9 as the df of the test and the
>> p-value depending on these two values?
>
> That's the general mechanism, yes. (Whether the chi-square distribution holds after variable selection is a more difficult issue. Frank Harrell might chime in and remind us that there are books on that subject.)

Frank might be busy with useR preparations for next week...

Quoting from Frank's book "Regression Modeling Strategies", page 58, in the context of variable selection, stepwise methods and stopping rules:

"The residual $\chi^2$ can be tested for significance (if one is able to forget that because of variable selection this statistic does not have a $\chi^2$ distribution), or the stopping rule can be based on Akaike's information criterion (AIC), here residual $\chi^2$ - 2 x d.f. Of course, use of more insight from knowledge of the subject matter will generally improve the modeling process substantially. It must be remembered that no currently available stopping rule was developed for data driven variable selection. Stopping rules such as AIC or Mallows' $C_p$ are intended for comparing only two \emph{prespecified} models."

The entire chapter (4) discusses these issues in more detail and as Peter notes there are other books and papers that focus on the underlying issue of variable selection.

As Frank is oft-quoted as saying: "Variable selection is hazardous both to inference and to prediction. There is no free lunch; we are torturing data to confess its own sins."

Going back to Lincoln's prior post in the thread, presuming that there is sufficient data to use the original pre-specified model and also that the original full model itself was not derived from prior variable selection or univariate pre-screening:

  mod1 <- glm(site ~ sex + birth + cohort + sex:birth, data=datasex, family = binomial)

I would recommend reviewing the likelihood ratio test for that model versus the null model:

  anova(mod1, test = "Chisq")

and determine whether or not 'cohort' was significant at some level there, rather than in the final reduced model.

You might also want to consider using some of the tools in Frank's rms package on CRAN to further evaluate/validate that model.

Regards,

Marc Schwartz
## Re: Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

 Thank you all, This was exactly the sort of help I hoped to get.