Dear R.,

The problem you constructed is too ill-conditioned for the method that Anova() uses to compute type-II sums of squares and the associated degrees of freedom, with an immense condition number of the coefficient covariance matrix:

> library(car)

Loading required package: carData

> mod <- lm(prestige ~ women * type * income * education, data=Prestige)

> e <- eigen(vcov(mod))$values

> max(e)/min(e)

[1] 2.776205e+17

Simply centering the numerical predictors reduces the condition number by a factor of 10^3, which allows Anova() to work, even though the problem is still extremely ill-conditioned:

> Prestige.c <- within(Prestige, {

+ income <- income - mean(income)

+ education <- education - mean(education)

+ women <- women - mean(women)

+ })

> mod.c <- lm(prestige ~ women * type * income * education, data=Prestige.c)

> e.c <- eigen(vcov(mod.c))$values

> max(e)/min(e)

[1] 2.776205e+17

> Anova(mod.c)

Anova Table (Type II tests)

Response: prestige

Sum Sq Df F value Pr(>F)

women 167.29 1 4.9516 0.0291142 *

type 744.30 2 11.0150 6.494e-05 ***

income 789.00 1 23.3529 7.112e-06 ***

education 699.54 1 20.7050 2.057e-05 ***

women:type 140.32 2 2.0766 0.1326023

women:income 33.14 1 0.9807 0.3252424

type:income 653.40 2 9.6697 0.0001859 ***

women:education 30.36 1 0.8986 0.3462316

type:education 0.72 2 0.0107 0.9893462

income:education 7.88 1 0.2331 0.6306681

women:type:income 136.80 2 2.0245 0.1393087

women:type:education 140.18 2 2.0745 0.1328633

women:income:education 100.42 1 2.9722 0.0888832 .

type:income:education 82.02 2 1.2138 0.3029069

women:type:income:education 2.05 2 0.0303 0.9701334

Residuals 2500.16 74

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> mod.c.2 <- update(mod.c, . ~ . - women:type:income:education)

> sum(residuals(mod.c.2)^2) - sum(residuals(mod.c)^2)

[1] 2.049735

Beyond demonstrating that the algorithm that Anova() uses can be made to fail if the coefficient covariance matrix is sufficiently ill-conditioned problem, I’m not sure what the point of this is. I suppose that we could try to detect this condition, which falls in the small region between where lm() detects a singularity and the projections used by Anova() break down.

Best,

John

-------------------------------------------------

John Fox, Professor Emeritus

McMaster University

Hamilton, Ontario, Canada

Web: http::/socserv.mcmaster.ca/jfox

> On Dec 5, 2018, at 7:33 PM, Ramon Diaz-Uriarte <

[hidden email]> wrote:

>

>

> Dear All,

>

> I do not understand the degrees of freedom returned by car::Anova under

> some models. They seem to be too many (e.g., numerical variables getting

> more than 1 df, factors getting more df than levels there are).

>

> This is a reproducible example:

>

> library(car)

> data(Prestige)

>

> ## Make sure no issues from NAs in comparisons of SS below

> prestige_nona <- na.omit(Prestige)

>

> Anova(lm(prestige ~ women * type * income * education,

> data = prestige_nona))

>

> ## Notice how women, a numerical variable, has 3 df

> ## and type (factor with 3 levels) has 4 df.

>

>

> ## In contrast this seems to get the df right:

> Anova(lm(prestige ~ women * type * income * education,

> data = prestige_nona), type = "III")

>

> ## And also gives the df I'd expect

> anova(lm(prestige ~ women * type * income * education,

> data = prestige_nona))

>

>

>

> ## Type II SS for women in the above model I do not understand either.

> m_1 <- lm(prestige ~ type * income * education, data = prestige_nona)

> m_2 <- lm(prestige ~ type * income * education + women, data = prestige_nona)

> ## Does not match women SS

> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)

>

> ## See [1] below for examples where they match.

>

>

> Looking at the code, I do not understand what the call from

> linearHypothesis returns here (specially compared to other models), and the

> problem seems to be in the return from ConjComp, possibly due to the the

> vcov of the model? (But this is over my head).

>

>

> I understand this is not a reasonable model to fit, and there are possibly

> serious collinearity problems. But I was surprised by the dfs in the

> absence of any warning of something gone wrong. So I think there is

> something very basic I do not understand.

>

>

>

> Thanks,

>

>

> R.

>

>

> [1] In contrast, in other models I see what I'd expect. For example:

>

> ## 1 df for women, 2 for type

> Anova(lm(prestige ~ type * income * women, data = prestige_nona))

> m_1 <- lm(prestige ~ type * income, data = prestige_nona)

> m_2 <- lm(prestige ~ type * income + women, data = prestige_nona)

> ## Type II SS for women

> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)

>

> ## 1 df for women, income, education

> Anova(lm(prestige ~ education * income * women, data = prestige_nona))

> m_1 <- lm(prestige ~ education * income, data = prestige_nona)

> m_2 <- lm(prestige ~ education * income + women, data = prestige_nona)

> ## Type II SS for women

> sum(residuals(m_1)^2) - sum(residuals(m_2)^2)

>

>

>

>

> --

> Ramon Diaz-Uriarte

> Department of Biochemistry, Lab B-25

> Facultad de Medicina

> Universidad Autónoma de Madrid

> Arzobispo Morcillo, 4

> 28029 Madrid

> Spain

>

> Phone: +34-91-497-2412

>

> Email:

[hidden email]
>

[hidden email]
>

>

http://ligarto.org/rdiaz>

> ______________________________________________

>

[hidden email] mailing list -- To UNSUBSCRIBE and more, see

>

https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

______________________________________________

[hidden email] mailing list -- To UNSUBSCRIBE and more, see

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.