Strange degrees of freedom and SS from car::Anova with type II SS?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange degrees of freedom and SS from car::Anova with type II SS?

Ramon Diaz-Uriarte-2

Dear All,

I do not understand the degrees of freedom returned by car::Anova under
some models. They seem to be too many (e.g., numerical variables getting
more than 1 df, factors getting more df than levels there are).

This is a reproducible example:

library(car)
data(Prestige)

## Make sure no issues from NAs in comparisons of SS below
prestige_nona <- na.omit(Prestige)

Anova(lm(prestige ~ women * type * income * education,
         data = prestige_nona))

## Notice how women, a numerical variable, has 3 df
## and type (factor with 3 levels) has 4 df.


## In contrast this seems to get the df right:
Anova(lm(prestige ~ women * type * income * education,
         data = prestige_nona), type = "III")

## And also gives the df I'd expect
anova(lm(prestige ~ women * type * income * education,
         data = prestige_nona))



## Type II SS for women in the above model I do not understand either.
m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)
m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)
## Does not match women SS
sum(residuals(m_1)^2) - sum(residuals(m_2)^2)

## See [1] below for examples where they match.


Looking at the code, I do not understand what the call from
linearHypothesis returns here (specially compared to other models), and the
problem seems to be in the return from ConjComp, possibly due to the the
vcov of the model? (But this is over my head).


I understand this is not a reasonable model to fit, and there are possibly
serious collinearity problems. But I was surprised by the dfs in the
absence of any warning of something gone wrong. So I think there is
something very basic I do not understand.



Thanks,


R.


[1] In contrast, in other models I see what I'd expect. For example:

## 1 df for women, 2 for type
Anova(lm(prestige ~ type * income * women, data = prestige_nona))
m_1 <- lm(prestige ~ type * income, data = prestige_nona)
m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)
## Type II SS for women
sum(residuals(m_1)^2) - sum(residuals(m_2)^2)

## 1 df for women, income, education
Anova(lm(prestige ~ education * income * women, data = prestige_nona))
m_1 <- lm(prestige ~ education * income, data = prestige_nona)
m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)
## Type II SS for women
sum(residuals(m_1)^2) - sum(residuals(m_2)^2)




--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: [hidden email]
       [hidden email]

http://ligarto.org/rdiaz

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange degrees of freedom and SS from car::Anova with type II SS?

Fox, John
Dear R.,

The problem you constructed is too ill-conditioned for the method that Anova() uses to compute type-II sums of squares and the associated degrees of freedom, with an immense condition number of the coefficient covariance matrix:

> library(car)
Loading required package: carData

> mod <- lm(prestige ~ women * type * income * education, data=Prestige)
> e <- eigen(vcov(mod))$values
> max(e)/min(e)
[1] 2.776205e+17

Simply centering the numerical predictors reduces the condition number by a factor of 10^3, which allows Anova() to work, even though the problem is still extremely ill-conditioned:

> Prestige.c <- within(Prestige, {
+   income <- income - mean(income)
+   education <- education - mean(education)
+   women <- women - mean(women)
+ })
> mod.c <- lm(prestige ~ women * type * income * education, data=Prestige.c)
> e.c <- eigen(vcov(mod.c))$values
> max(e)/min(e)
[1] 2.776205e+17

> Anova(mod.c)
Anova Table (Type II tests)

Response: prestige
                             Sum Sq Df F value    Pr(>F)    
women                        167.29  1  4.9516 0.0291142 *  
type                         744.30  2 11.0150 6.494e-05 ***
income                       789.00  1 23.3529 7.112e-06 ***
education                    699.54  1 20.7050 2.057e-05 ***
women:type                   140.32  2  2.0766 0.1326023    
women:income                  33.14  1  0.9807 0.3252424    
type:income                  653.40  2  9.6697 0.0001859 ***
women:education               30.36  1  0.8986 0.3462316    
type:education                 0.72  2  0.0107 0.9893462    
income:education               7.88  1  0.2331 0.6306681    
women:type:income            136.80  2  2.0245 0.1393087    
women:type:education         140.18  2  2.0745 0.1328633    
women:income:education       100.42  1  2.9722 0.0888832 .  
type:income:education         82.02  2  1.2138 0.3029069    
women:type:income:education    2.05  2  0.0303 0.9701334    
Residuals                   2500.16 74                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> mod.c.2 <- update(mod.c, . ~ . - women:type:income:education)
> sum(residuals(mod.c.2)^2) - sum(residuals(mod.c)^2)
[1] 2.049735

Beyond demonstrating that the algorithm that Anova() uses can be made to fail if the coefficient covariance matrix is sufficiently ill-conditioned problem, I’m not sure what the point of this is. I suppose that we could try to detect this condition, which falls in the small region between where lm() detects a singularity and the projections used by Anova() break down.

Best,
 John

-------------------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http::/socserv.mcmaster.ca/jfox

> On Dec 5, 2018, at 7:33 PM, Ramon Diaz-Uriarte <[hidden email]> wrote:
>
>
> Dear All,
>
> I do not understand the degrees of freedom returned by car::Anova under
> some models. They seem to be too many (e.g., numerical variables getting
> more than 1 df, factors getting more df than levels there are).
>
> This is a reproducible example:
>
> library(car)
> data(Prestige)
>
> ## Make sure no issues from NAs in comparisons of SS below
> prestige_nona <- na.omit(Prestige)
>
> Anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona))
>
> ## Notice how women, a numerical variable, has 3 df
> ## and type (factor with 3 levels) has 4 df.
>
>
> ## In contrast this seems to get the df right:
> Anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona), type = "III")
>
> ## And also gives the df I'd expect
> anova(lm(prestige ~ women * type * income * education,
>         data = prestige_nona))
>
>
>
> ## Type II SS for women in the above model I do not understand either.
> m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)
> ## Does not match women SS
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
> ## See [1] below for examples where they match.
>
>
> Looking at the code, I do not understand what the call from
> linearHypothesis returns here (specially compared to other models), and the
> problem seems to be in the return from ConjComp, possibly due to the the
> vcov of the model? (But this is over my head).
>
>
> I understand this is not a reasonable model to fit, and there are possibly
> serious collinearity problems. But I was surprised by the dfs in the
> absence of any warning of something gone wrong. So I think there is
> something very basic I do not understand.
>
>
>
> Thanks,
>
>
> R.
>
>
> [1] In contrast, in other models I see what I'd expect. For example:
>
> ## 1 df for women, 2 for type
> Anova(lm(prestige ~ type * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ type * income, data = prestige_nona)
> m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
> ## 1 df for women, income, education
> Anova(lm(prestige ~ education * income * women, data = prestige_nona))
> m_1 <- lm(prestige ~ education * income, data = prestige_nona)
> m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)
> ## Type II SS for women
> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>
>
>
>
> --
> Ramon Diaz-Uriarte
> Department of Biochemistry, Lab B-25
> Facultad de Medicina
> Universidad Autónoma de Madrid
> Arzobispo Morcillo, 4
> 28029 Madrid
> Spain
>
> Phone: +34-91-497-2412
>
> Email: [hidden email]
>       [hidden email]
>
> http://ligarto.org/rdiaz
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange degrees of freedom and SS from car::Anova with type II SS?

Ramon Diaz-Uriarte-2
Dear John,

Thank you very much for your reply.


On Thu, 06-December-2018, at 03:43:06, Fox, John <[hidden email]> wrote:
> Dear R.,
>
> The problem you constructed is too ill-conditioned for the method that
> Anova() uses to compute type-II sums of squares and the associated
> degrees of freedom, with an immense condition number of the coefficient
> covariance matrix:

As I said, I understand this is not a sensible model to fit (this was a
quick example that I put together on the fly during a class when talking
about models with high-order interactions involving factors and numerical
variables). I forgot to mention that I also checked the eigenvalues of vcov
as well as the VIFs and they all indicated likely trouble.


>
>> library(car)
> Loading required package: carData
>
>> mod <- lm(prestige ~ women * type * income * education, data=Prestige)
>> e <- eigen(vcov(mod))$values
>> max(e)/min(e)
> [1] 2.776205e+17
>
> Simply centering the numerical predictors reduces the condition number by
> a factor of 10^3, which allows Anova() to work, even though the problem
> is still extremely ill-conditioned:

Thanks for pointing that out. I should have done that!


>
>> Prestige.c <- within(Prestige, {
> +   income <- income - mean(income)
> +   education <- education - mean(education)
> +   women <- women - mean(women)
> + })
>> mod.c <- lm(prestige ~ women * type * income * education, data=Prestige.c)
>> e.c <- eigen(vcov(mod.c))$values
>> max(e)/min(e)
> [1] 2.776205e+17
>
>> Anova(mod.c)
> Anova Table (Type II tests)
>
> Response: prestige
>                              Sum Sq Df F value    Pr(>F)
> women                        167.29  1  4.9516 0.0291142 *
> type                         744.30  2 11.0150 6.494e-05 ***
> income                       789.00  1 23.3529 7.112e-06 ***
> education                    699.54  1 20.7050 2.057e-05 ***
> women:type                   140.32  2  2.0766 0.1326023
> women:income                  33.14  1  0.9807 0.3252424
> type:income                  653.40  2  9.6697 0.0001859 ***
> women:education               30.36  1  0.8986 0.3462316
> type:education                 0.72  2  0.0107 0.9893462
> income:education               7.88  1  0.2331 0.6306681
> women:type:income            136.80  2  2.0245 0.1393087
> women:type:education         140.18  2  2.0745 0.1328633
> women:income:education       100.42  1  2.9722 0.0888832 .
> type:income:education         82.02  2  1.2138 0.3029069
> women:type:income:education    2.05  2  0.0303 0.9701334
> Residuals                   2500.16 74
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>> mod.c.2 <- update(mod.c, . ~ . - women:type:income:education)
>> sum(residuals(mod.c.2)^2) - sum(residuals(mod.c)^2)
> [1] 2.049735
>
> Beyond demonstrating that the algorithm that Anova() uses can be made to
> fail if the coefficient covariance matrix is sufficiently ill-conditioned
> problem, I’m not sure what the point of this is. I suppose that we could
> try to detect this condition, which falls in the small region between
> where lm() detects a singularity and the projections used by Anova()
> break down.

The point of this: I was surprised that the only immediate hint I saw of
something gone wrong were the dfs (some numerical variables with > 1 df,
factors with more dfs than actual levels) with no additional warning.

But then, I am not sure if a simple check of dfs is something simple to add
to the code, or more generally whether trying to detect this condition is
worth it, as this might just be too borderline a case (as you say "the
small region between where lm() detects a singularity and the projections
used by Anova() break down.").


Again, thanks for clarifying my confusion.


Best,


Ramon

>
> Best,
>  John
>
> -------------------------------------------------
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: http::/socserv.mcmaster.ca/jfox
>
>> On Dec 5, 2018, at 7:33 PM, Ramon Diaz-Uriarte <[hidden email]> wrote:
>>
>>
>> Dear All,
>>
>> I do not understand the degrees of freedom returned by car::Anova under
>> some models. They seem to be too many (e.g., numerical variables getting
>> more than 1 df, factors getting more df than levels there are).
>>
>> This is a reproducible example:
>>>
>> library(car)
>> data(Prestige)
>>
>> ## Make sure no issues from NAs in comparisons of SS below
>> prestige_nona <- na.omit(Prestige)
>>
>> Anova(lm(prestige ~ women * type * income * education,
>>         data = prestige_nona))
>>
>> ## Notice how women, a numerical variable, has 3 df
>> ## and type (factor with 3 levels) has 4 df.
>>
>>
>> ## In contrast this seems to get the df right:
>> Anova(lm(prestige ~ women * type * income * education,
>>         data = prestige_nona), type = "III")
>>
>> ## And also gives the df I'd expect
>> anova(lm(prestige ~ women * type * income * education,
>>         data = prestige_nona))
>>
>>
>>
>> ## Type II SS for women in the above model I do not understand either.
>> m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)
>> m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)
>> ## Does not match women SS
>> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>>
>> ## See [1] below for examples where they match.
>>
>>
>> Looking at the code, I do not understand what the call from
>> linearHypothesis returns here (specially compared to other models), and the
>> problem seems to be in the return from ConjComp, possibly due to the the
>> vcov of the model? (But this is over my head).
>>
>>
>> I understand this is not a reasonable model to fit, and there are possibly
>> serious collinearity problems. But I was surprised by the dfs in the
>> absence of any warning of something gone wrong. So I think there is
>> something very basic I do not understand.
>>
>>
>>
>> Thanks,
>>
>>
>> R.
>>
>>
>> [1] In contrast, in other models I see what I'd expect. For example:
>>
>> ## 1 df for women, 2 for type
>> Anova(lm(prestige ~ type * income * women, data = prestige_nona))
>> m_1 <- lm(prestige ~ type * income, data = prestige_nona)
>> m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)
>> ## Type II SS for women
>> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>>
>> ## 1 df for women, income, education
>> Anova(lm(prestige ~ education * income * women, data = prestige_nona))
>> m_1 <- lm(prestige ~ education * income, data = prestige_nona)
>> m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)
>> ## Type II SS for women
>> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)
>>
>>
>>
>>
>> --
>> Ramon Diaz-Uriarte
>> Department of Biochemistry, Lab B-25
>> Facultad de Medicina
>> Universidad Autónoma de Madrid
>> Arzobispo Morcillo, 4
>> 28029 Madrid
>> Spain
>>
>> Phone: +34-91-497-2412
>>
>> Email: [hidden email]
>>       [hidden email]
>>
>> http://ligarto.org/rdiaz
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: [hidden email]
       [hidden email]

http://ligarto.org/rdiaz

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.