Order of formula terms in model.matrix

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Order of formula terms in model.matrix

Lars Bishop-2
I’d appreciate your help on understanding the following.

It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
 

set.seed(1)
x1 <- rnorm(100)
f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
trt <- sample(c(-1,1), 100, replace = TRUE)
df <- data.frame(x1=x1, f1=f1, trt=trt)

dim(model.matrix( ~ x1:trt + f1:trt, data = df))
[1] 100 4

dim(model.matrix(~ f1:trt + x1:trt, data = df))
[1] 100 5


Thanks,
Lars.
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Berry, Charles
On Sun, 17 Jan 2016, Lars Bishop wrote:

> I’d appreciate your help on understanding the following.
>

> It is not very clear to me from the model.matrix documentation, why
> simply changing the order of terms in the formula may change the number
> of resulting columns. Please note I’m purposely not including main
> effects in the model formula in this case.


IIRC, there are some heuristics involved harking back to the White Book. I
recall there have been discussions of whether and how this could be fixed
before on this list and or R-devel, but I cannot seem to lay my browser on
them right now.


>
>
> set.seed(1)
> x1 <- rnorm(100)
> f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
> trt <- sample(c(-1,1), 100, replace = TRUE)
> df <- data.frame(x1=x1, f1=f1, trt=trt)
>
> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
> [1] 100 4
>
> dim(model.matrix(~ f1:trt + x1:trt, data = df))
> [1] 100 5
>

By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.

If you use the latter form, the issue you raise goes away.

Note that `I(some.expr)' gives you the ability to force the behavior of
model.matrix to be exactly what you want by suitably crafting `some.expr',
heuristics notwithstanding.

HTH,

Chuck
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Lars Bishop-2
This is very helpful, thanks!

Lars.


> On Jan 17, 2016, at 1:34 PM, Charles C. Berry <[hidden email]> wrote:
>
> On Sun, 17 Jan 2016, Lars Bishop wrote:
>
>> I’d appreciate your help on understanding the following.
>
>> It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
>
>
> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or R-devel, but I cannot seem to lay my browser on them right now.
>
>
>>
>> set.seed(1)
>> x1 <- rnorm(100)
>> f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
>> trt <- sample(c(-1,1), 100, replace = TRUE)
>> df <- data.frame(x1=x1, f1=f1, trt=trt)
>>
>> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
>> [1] 100 4
>>
>> dim(model.matrix(~ f1:trt + x1:trt, data = df))
>> [1] 100 5
>>
>
> By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.
>
> If you use the latter form, the issue you raise goes away.
>
> Note that `I(some.expr)' gives you the ability to force the behavior of model.matrix to be exactly what you want by suitably crafting `some.expr', heuristics notwithstanding.
>
> HTH,
>
> Chuck
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Lars Bishop-2
Question: I(a*b) would work as long as both “a" and “b” are numeric. Is there a way I can force the behaviour of model.matrix when one of these variables is a factor (as in "f1:trt" from my example below)?

Specifically, based on my example below, I would like to always return the first matrix (where the first level of “f1” is omitted in the resulting matrix). This would’t happen if the user specifies the terms in reverse order (as per the second matrix).

Thanks again,
Lars.


> On Jan 17, 2016, at 2:53 PM, Lars Bishop <[hidden email]> wrote:
>
> This is very helpful, thanks!
>
> Lars.
>
>
>> On Jan 17, 2016, at 1:34 PM, Charles C. Berry <[hidden email]> wrote:
>>
>> On Sun, 17 Jan 2016, Lars Bishop wrote:
>>
>>> I’d appreciate your help on understanding the following.
>>
>>> It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
>>
>>
>> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or R-devel, but I cannot seem to lay my browser on them right now.
>>
>>
>>>
>>> set.seed(1)
>>> x1 <- rnorm(100)
>>> f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
>>> trt <- sample(c(-1,1), 100, replace = TRUE)
>>> df <- data.frame(x1=x1, f1=f1, trt=trt)
>>>
>>> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
>>> [1] 100 4
>>>
>>> dim(model.matrix(~ f1:trt + x1:trt, data = df))
>>> [1] 100 5
>>>
>>
>> By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.
>>
>> If you use the latter form, the issue you raise goes away.
>>
>> Note that `I(some.expr)' gives you the ability to force the behavior of model.matrix to be exactly what you want by suitably crafting `some.expr', heuristics notwithstanding.
>>
>> HTH,
>>
>> Chuck
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Peter Dalgaard-2
In reply to this post by Berry, Charles

> On 17 Jan 2016, at 19:34 , Charles C. Berry <[hidden email]> wrote:
>
>
> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or R-devel, but I cannot seem to lay my browser on them right now.
>

And IIRC: yup, and one of the issues is that
(a) some rules work left-to-right
(b) the logic is oblivious to the factor/vector distinction

For factors a,b,c, what happens for ~a:b + b:c is that a:b gets the full term expansion since the marginals a and b are not in the model but since b is part of the fully expanded a:b,  b:c gets the reduced form expansion as it would in ~b + b:c (the c-within-b thing). Swapping the terms gives you a different result, but at least it is the same model in the sense that the columns span the same subspace.

If a and b are vectors, and c is a factor, you get the same logic: expand a:b fully, then treat b:c as in b + b:c. Unfortunately, a:b is just the product of a and b, whether or not it is fully expanded, so it doesn't really make sense to proceed as if b is contained in a preceding term. So the net result is that you end up with one column less than you probably wanted.



--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Guelman, Leo
In reply to this post by Lars Bishop-2
Is it really the same model?

Following the example provided by Lars:

set.seed(1)
x1 <- rnorm(100)
f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
trt <- sample(c(-1,1), 100, replace = TRUE)
y <- factor(sample(c(0,1), 100, T))
df <- data.frame(y=y, x1=x1, f1=f1, trt=trt)

fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
coef(fit1)

fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
coef(fit2)

identical(fitted(fit1), fitted(fit2))
[1] FALSE



_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of this confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Peter Dalgaard-2

On 18 Jan 2016, at 16:49 , Guelman, Leo <[hidden email]> wrote:

> Is it really the same model?


No, and I didn't say that they would be. I did say that they would be in the all-factor case, which does seem to be right:

> df$trt <- factor(df$trt)
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # still differs
> df$x1 <- factor(sample(c(-1,1), 100, replace = TRUE))
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # looks like it's on diagonal
> identical(fitted(fit1), fitted(fit2)) # wrong check
[1] FALSE
> all.equal(fitted(fit1), fitted(fit2)) # better
[1] TRUE


-pd


>  
> Following the example provided by Lars:
>  
> set.seed(1)
> x1 <- rnorm(100)
> f1 <- factor(sample(letters[1:3], 100, replace = TRUE))
> trt <- sample(c(-1,1), 100, replace = TRUE)
> y <- factor(sample(c(0,1), 100, T))
> df <- data.frame(y=y, x1=x1, f1=f1, trt=trt)
>  
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> coef(fit1)
>  
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> coef(fit2)
>  
> identical(fitted(fit1), fitted(fit2))
> [1] FALSE
>  
>  
>  
> _______________________________________________________________________
>
> If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
>

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Order of formula terms in model.matrix

Guelman, Leo
Thanks Peter. That make sense. Nevertheless, what comes at a surprise to me (and maybe to others) is that one can potentially get different fits by simply swapping the terms in the model formula.

Best,
Leo.

-----Original Message-----
From: peter dalgaard [mailto:[hidden email]]
Sent: 2016, January, 18 11:16 AM
To: Guelman, Leo
Cc: [hidden email]; Charles C. Berry
Subject: Re: [R] Order of formula terms in model.matrix


On 18 Jan 2016, at 16:49 , Guelman, Leo <[hidden email]> wrote:

> Is it really the same model?


No, and I didn't say that they would be. I did say that they would be in the all-factor case, which does seem to be right:

> df$trt <- factor(df$trt)
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # still differs
> df$x1 <- factor(sample(c(-1,1), 100, replace = TRUE))
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # looks like it's on diagonal
> identical(fitted(fit1), fitted(fit2)) # wrong check
[1] FALSE
> all.equal(fitted(fit1), fitted(fit2)) # better
[1] TRUE


-pd


>  
> Following the example provided by Lars:
>  
> set.seed(1)
> x1 <- rnorm(100)
> f1 <- factor(sample(letters[1:3], 100, replace = TRUE)) trt <-
> sample(c(-1,1), 100, replace = TRUE) y <- factor(sample(c(0,1), 100,
> T)) df <- data.frame(y=y, x1=x1, f1=f1, trt=trt)
>  
> fit1 <- glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> coef(fit1)
>  
> fit2 <- glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> coef(fit2)
>  
> identical(fitted(fit1), fitted(fit2))
> [1] FALSE
>  
>  
>  
> ______________________________________________________________________
> _
>
> If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
>

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the above-noted email address; please retain a copy of this confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.