Why does the order of terms in a formula translate into different models/ model matrices?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Why does the order of terms in a formula translate into different models/ model matrices?

Alexandra
Dear all,

I have encountered some strange things when creating lm objects in R:  model depends on the order of the terms specified in a formula.
Let us consider the following simple example:

> dat <- expand.grid(A = factor(c("a1", "a2")),
+                    B = factor(paste("b", 1:4, sep="")),
+                    rep = factor(1:2))
> set.seed(12345)
> dat$x <- 1:16 * .5 + runif(16)
> dat$Y <- rnorm(16)

> dat
    A  B rep        x          Y
1  a1 b1   1 1.220904 -0.2841597
2  a2 b1   1 1.875773 -0.9193220
3  a1 b2   1 2.260982 -0.1162478
4  a2 b2   1 2.886125  1.8173120
5  a1 b3   1 2.956481  0.3706279
6  a2 b3   1 3.166372  0.5202165
7  a1 b4   1 3.825095 -0.7505320
8  a2 b4   1 4.509224  0.8168998
9  a1 b1   2 5.227705 -0.8863575
10 a2 b1   2 5.989737 -0.3315776
11 a1 b2   2 5.534535  1.1207127
12 a2 b2   2 6.152373  0.2987237
13 a1 b3   2 7.235685  0.7796219
14 a2 b3   2 7.001137  1.4557851
15 a1 b4   2 7.891203 -0.6443284
16 a2 b4   2 8.462495 -1.5531374


> logLik(m0 <- lm(Y ~ A:B + x:A, dat))
'log Lik.' -13.22186 (df=11)

> logLik(m1 <- lm(Y ~ x:A + A:B, dat))
'log Lik.' -13.66822 (df=10)

To me it is a bit strange that m0 and m1 models appear to have different loglikelihood (only the order of the terms in a formula was changed!)

My guess is that the problem lies in model.matrix:

X1 <- model.matrix(~ x:A + A:B, dat)
X2 <- model.matrix(~ A:B + x:A, dat)


## number of columns:
ncol(X1) ## 9

> ncol(X2) ## 11

## rank of design matrices:
qr(X1)$rank ## 9

qr(X2)$rank ## 10

 
Will be very grateful if someone could help me here!

Thanks,

Alexandra



Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

bbolker
Alexandra <alku <at> imm.dtu.dk> writes:

>
> Dear all,
>
> I have encountered some strange things when creating lm objects in R:  model
> depends on the order of the terms specified in a formula.
> Let us consider the following simple example:
>
> > dat <- expand.grid(A = factor(c("a1", "a2")),
> +                    B = factor(paste("b", 1:4, sep="")),
> +                    rep = factor(1:2))
> > set.seed(12345)
> > dat$x <- 1:16 * .5 + runif(16)
> > dat$Y <- rnorm(16)
>
> > dat

  [snip]

>
> > logLik(m0 <- lm(Y ~ A:B + x:A, dat))
> 'log Lik.' -13.22186 (df=11)
>
> > logLik(m1 <- lm(Y ~ x:A + A:B, dat))
> 'log Lik.' -13.66822 (df=10)
>
> To me it is a bit strange that m0 and m1 models appear to have different
> loglikelihood (only the order of the terms in a formula was changed!)
>
> My guess is that the problem lies in model.matrix:
>
> X1 <- model.matrix(~ x:A + A:B, dat)
> X2 <- model.matrix(~ A:B + x:A, dat)
>
> ## number of columns:
> ncol(X1) ## 9
>
> > ncol(X2) ## 11
>
> ## rank of design matrices:
> qr(X1)$rank ## 9
>
> qr(X2)$rank ## 10

   I agree that this seems weird, and I haven't been able to figure
it out yet.  My best (not very well-informed) guess is that there is
something going on with automatic dropping of terms that appear to
be aliased?? and that this test is (perhaps unintentionally) order-
dependent. I don't see anything obvious in the R code
(model.matrix.default) that would be responsible, and I haven't
had time (and probably won't) to go through the underlying C code
(do_modelmatrix in src/main/model.c) to see what's going on.

  For what it's worth, the ultimate reference for how this stuff
is *supposed* to work is (I think) the "White Book" (Statistical
Models in S, Chambers and Hastie) -- I don't alas have a copy.
I don't see any further references to model.matrix or 'model matrix'
in the R manuals, other than a fairly short description in section
11 of the "Intro to R".

  In other words, I'm stumped too and I hope someone else can
provide a more informed answer.

  Ben Bolker

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

Mark Difford
On Jan 27, 2012; 6:29pm Ben Bolker wrote:

>  My best (not very well-informed) guess is that there is something going on with automatic dropping of terms
>  that appear to be aliased?? and that this test is (perhaps unintentionally) order-dependent.

Looks to me like Ben is close to the mark here. From ?alias: "Complete aliasing refers to effects in linear models that cannot be estimated independently of the terms which occur earlier in the model and so have their coefficients omitted from the fit."

> alias(m0, complete=T)
Model :
Y ~ A:B + x:A

Complete :
        (Intercept) Aa1:Bb1 Aa2:Bb1 Aa1:Bb2 Aa2:Bb2 Aa1:Bb3 Aa2:Bb3 Aa1:Bb4 Aa1:x Aa2:x
Aa2:Bb4  1          -1      -1      -1      -1      -1      -1      -1       0     0

> alias(m1, complete=T)
Model :
Y ~ x:A + A:B

However, if you fit "proper" (or statistically sensible models), then there is no problem reversing terms:

> logLik(m2 <- lm(Y ~ A*B + x*A, dat))
'log Lik.' -13.22186 (df=11)

> logLik(m3 <- lm(Y ~ x*A + A*B, dat))
'log Lik.' -13.22186 (df=11)

Regards, Mark Difford
Mark Difford (Ph.D.)
Research Associate
Botany Department
Nelson Mandela Metropolitan University
Port Elizabeth, South Africa
Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

cberry
In reply to this post by Alexandra
Alexandra <[hidden email]> writes:

> Dear all,
>
> I have encountered some strange things when creating lm objects in R:  model
> depends on the order of the terms specified in a formula.
> Let us consider the following simple example:
>
>> dat <- expand.grid(A = factor(c("a1", "a2")),
> +                    B = factor(paste("b", 1:4, sep="")),
> +                    rep = factor(1:2))
>> set.seed(12345)
>> dat$x <- 1:16 * .5 + runif(16)
>> dat$Y <- rnorm(16)
>
>> dat
>     A  B rep        x          Y
> 1  a1 b1   1 1.220904 -0.2841597
> 2  a2 b1   1 1.875773 -0.9193220
> 3  a1 b2   1 2.260982 -0.1162478
> 4  a2 b2   1 2.886125  1.8173120
> 5  a1 b3   1 2.956481  0.3706279
> 6  a2 b3   1 3.166372  0.5202165
> 7  a1 b4   1 3.825095 -0.7505320
> 8  a2 b4   1 4.509224  0.8168998
> 9  a1 b1   2 5.227705 -0.8863575
> 10 a2 b1   2 5.989737 -0.3315776
> 11 a1 b2   2 5.534535  1.1207127
> 12 a2 b2   2 6.152373  0.2987237
> 13 a1 b3   2 7.235685  0.7796219
> 14 a2 b3   2 7.001137  1.4557851
> 15 a1 b4   2 7.891203 -0.6443284
> 16 a2 b4   2 8.462495 -1.5531374
>
>
>> logLik(m0 <- lm(Y ~ A:B + x:A, dat))
> 'log Lik.' -13.22186 (df=11)
>
>> logLik(m1 <- lm(Y ~ x:A + A:B, dat))
> 'log Lik.' -13.66822 (df=10)
>
> To me it is a bit strange that m0 and m1 models appear to have different
> loglikelihood (only the order of the terms in a formula was changed!)
>
> My guess is that the problem lies in model.matrix:
>
> X1 <- model.matrix(~ x:A + A:B, dat)
> X2 <- model.matrix(~ A:B + x:A, dat)


Close, but not quite. The problem lies in terms()

Here are the attr(terms(...),"factors") matrices:

 > attributes(terms(Y ~ x:A + A:B,data=dat))$factors
  x:A A:B
Y   0   0
x   2   0
A   2   2
B   0   1
>   attributes(terms(Y ~ A:B + x:A ,data=dat))$factors
  A:B A:x
Y   0   0
A   2   2
B   2   0
x   0   1

As you see, the encoding of x and B are treated differently under the
two orderings.

See ?terms.object for what those codes mean.

Same deal for these seemingly equivalent formulae:


>   attributes(terms(Y ~ (x + A + B)^2-A,data=dat))$factors
  x B x:A x:B A:B
Y 0 0   0   0   0
x 1 0   2   1   0
A 0 0   1   0   1
B 0 1   0   1   1
>   attributes(terms(Y ~ (A + B + x)^2-A,data=dat))$factors
  B x A:B A:x B:x
Y 0 0   0   0   0
A 0 0   1   1   0
B 1 0   2   0   1
x 0 1   0   1   1
>


AFAICS, this is a bug.

HTH,

Chuck

>
>
> ## number of columns:
> ncol(X1) ## 9
>
>> ncol(X2) ## 11
>
> ## rank of design matrices:
> qr(X1)$rank ## 9
>
> qr(X2)$rank ## 10
>
>  
> Will be very grateful if someone could help me here!
>
> Thanks,
>
> Alexandra
>
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Why-does-the-order-of-terms-in-a-formula-translate-into-different-models-model-matrices-tp4333408p4333408.html
> Sent from the R help mailing list archive at Nabble.com.
>

--
Charles C. Berry                            Dept of Family/Preventive Medicine
cberry at ucsd edu    UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

bbolker
 <cberry <at> tajo.ucsd.edu> writes:

>
> Alexandra <alku <at> imm.dtu.dk> writes:
>
[snip]

 
 Close, but not quite. The problem lies in terms()
 
 Here are the attr(terms(...),"factors") matrices:
 
  > attributes(terms(Y ~ x:A + A:B,data=dat))$factors
   x:A A:B
 Y   0   0
 x   2   0
 A   2   2
 B   0   1
 >   attributes(terms(Y ~ A:B + x:A ,data=dat))$factors
   A:B A:x
 Y   0   0
 A   2   2
 B   2   0
 x   0   1
 
 As you see, the encoding of x and B are treated differently under the
 two orderings.
 
 See ?terms.object for what those codes mean.
 
 Same deal for these seemingly equivalent formulae:
 
 >   attributes(terms(Y ~ (x + A + B)^2-A,data=dat))$factors
   x B x:A x:B A:B
 Y 0 0   0   0   0
 x 1 0   2   1   0
 A 0 0   1   0   1
 B 0 1   0   1   1
 >   attributes(terms(Y ~ (A + B + x)^2-A,data=dat))$factors
   B x A:B A:x B:x
 Y 0 0   0   0   0
 A 0 0   1   1   0
 B 1 0   2   0   1
 x 0 1   0   1   1
 >

(quoting removed to make Gmane happy)
 
 AFAICS, this is a bug.
 

  I think so too, although I haven't got my head around it yet.

  Chuck, are you willing to post a summary of this to r-devel
for discussion ... and/or post a bug report?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

Peter Dalgaard-2

On Jan 29, 2012, at 02:42 , Ben Bolker wrote:

>>
>
> (quoting removed to make Gmane happy)
>
> AFAICS, this is a bug.
>
>
>  I think so too, although I haven't got my head around it yet.
>
>  Chuck, are you willing to post a summary of this to r-devel
> for discussion ... and/or post a bug report?

You have to be very specific about what the bug is, if any... I.e., which precisely are the rules that are broken by the current behavior?

Preferably also suggest a fix --- the corner cases of model.matrix and friends is some of the more impenetrable code in the R sources.

Notice that order dependent parametrization of terms is not a bug per se, nor is the automatic switch to dummy coding of factors. Consider these cases:

d <- cbind(expand.grid(a=c("a","b"),b=c("X","Y"),c=c("U","W")),x=1:8)
model.matrix(~ a:b + a:c, d)
model.matrix(~ a:c + a:b, d)
model.matrix(~ a:b + a:x, d)
model.matrix(~ a:x + a:b, d)

and notice that the logic applying to "x" is the same as that applying to "c".

The crux seems to be that the model ~a:c contains the model ~a whereas ~a:x does not, and hence the rationale for _not_ expanding a subsequent a:b term to dummies (namely that ~a is "already in" the model) fails.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Why does the order of terms in a formula translate into different models/ model matrices?

Peter Dalgaard-2
I know this is three months old, but I had put it on a mental TODO list to see whether it was something that could be fixed, and I finally got a little spare time for that sort of thing. It turns out that it can NOT be fixed (without fundamental design changes), so I thought I'd better record the conclusion for the archives.

The reason is pretty simple: It is terms.formula that decides which terms require contrast coding and which need dummy coding. It does this based on the formula structure _only_; i.e., it has no information on whether variables are factors or numerical. E.g., you can do

> terms(~ a:x + a:b)
~a:x + a:b
attr(,"variables")
list(a, x, b)
attr(,"factors")
  a:x a:b
a   2   2
x   2   0
b   0   1
attr(,"term.labels")
[1] "a:x" "a:b"
attr(,"order")
[1] 2 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
attr(,".Environment")
<environment: R_GlobalEnv>

in which there is no indication to terms() what kind of variable "x" is. So if you want different behavior if x is a factor than if it is continuous, you're out of luck...

Not quite sure the last three lines of my note below make sense, though.

-pd

On Jan 29, 2012, at 11:24 , peter dalgaard wrote:

>
> On Jan 29, 2012, at 02:42 , Ben Bolker wrote:
>>>
>>
>> (quoting removed to make Gmane happy)
>>
>> AFAICS, this is a bug.
>>
>>
>> I think so too, although I haven't got my head around it yet.
>>
>> Chuck, are you willing to post a summary of this to r-devel
>> for discussion ... and/or post a bug report?
>
> You have to be very specific about what the bug is, if any... I.e., which precisely are the rules that are broken by the current behavior?
>
> Preferably also suggest a fix --- the corner cases of model.matrix and friends is some of the more impenetrable code in the R sources.
>
> Notice that order dependent parametrization of terms is not a bug per se, nor is the automatic switch to dummy coding of factors. Consider these cases:
>
> d <- cbind(expand.grid(a=c("a","b"),b=c("X","Y"),c=c("U","W")),x=1:8)
> model.matrix(~ a:b + a:c, d)
> model.matrix(~ a:c + a:b, d)
> model.matrix(~ a:b + a:x, d)
> model.matrix(~ a:x + a:b, d)
>
> and notice that the logic applying to "x" is the same as that applying to "c".
>
> The crux seems to be that the model ~a:c contains the model ~a whereas ~a:x does not, and hence the rationale for _not_ expanding a subsequent a:b term to dummies (namely that ~a is "already in" the model) fails.
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: [hidden email]  Priv: [hidden email]
>
>
>
>
>
>
>
>

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.