

I’d appreciate your help on understanding the following.
It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
set.seed(1)
x1 < rnorm(100)
f1 < factor(sample(letters[1:3], 100, replace = TRUE))
trt < sample(c(1,1), 100, replace = TRUE)
df < data.frame(x1=x1, f1=f1, trt=trt)
dim(model.matrix( ~ x1:trt + f1:trt, data = df))
[1] 100 4
dim(model.matrix(~ f1:trt + x1:trt, data = df))
[1] 100 5
Thanks,
Lars.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Sun, 17 Jan 2016, Lars Bishop wrote:
> I’d appreciate your help on understanding the following.
>
> It is not very clear to me from the model.matrix documentation, why
> simply changing the order of terms in the formula may change the number
> of resulting columns. Please note I’m purposely not including main
> effects in the model formula in this case.
IIRC, there are some heuristics involved harking back to the White Book. I
recall there have been discussions of whether and how this could be fixed
before on this list and or Rdevel, but I cannot seem to lay my browser on
them right now.
>
>
> set.seed(1)
> x1 < rnorm(100)
> f1 < factor(sample(letters[1:3], 100, replace = TRUE))
> trt < sample(c(1,1), 100, replace = TRUE)
> df < data.frame(x1=x1, f1=f1, trt=trt)
>
> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
> [1] 100 4
>
> dim(model.matrix(~ f1:trt + x1:trt, data = df))
> [1] 100 5
>
By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.
If you use the latter form, the issue you raise goes away.
Note that `I(some.expr)' gives you the ability to force the behavior of
model.matrix to be exactly what you want by suitably crafting `some.expr',
heuristics notwithstanding.
HTH,
Chuck
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


This is very helpful, thanks!
Lars.
> On Jan 17, 2016, at 1:34 PM, Charles C. Berry < [hidden email]> wrote:
>
> On Sun, 17 Jan 2016, Lars Bishop wrote:
>
>> I’d appreciate your help on understanding the following.
>
>> It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
>
>
> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or Rdevel, but I cannot seem to lay my browser on them right now.
>
>
>>
>> set.seed(1)
>> x1 < rnorm(100)
>> f1 < factor(sample(letters[1:3], 100, replace = TRUE))
>> trt < sample(c(1,1), 100, replace = TRUE)
>> df < data.frame(x1=x1, f1=f1, trt=trt)
>>
>> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
>> [1] 100 4
>>
>> dim(model.matrix(~ f1:trt + x1:trt, data = df))
>> [1] 100 5
>>
>
> By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.
>
> If you use the latter form, the issue you raise goes away.
>
> Note that `I(some.expr)' gives you the ability to force the behavior of model.matrix to be exactly what you want by suitably crafting `some.expr', heuristics notwithstanding.
>
> HTH,
>
> Chuck
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Question: I(a*b) would work as long as both “a" and “b” are numeric. Is there a way I can force the behaviour of model.matrix when one of these variables is a factor (as in "f1:trt" from my example below)?
Specifically, based on my example below, I would like to always return the first matrix (where the first level of “f1” is omitted in the resulting matrix). This would’t happen if the user specifies the terms in reverse order (as per the second matrix).
Thanks again,
Lars.
> On Jan 17, 2016, at 2:53 PM, Lars Bishop < [hidden email]> wrote:
>
> This is very helpful, thanks!
>
> Lars.
>
>
>> On Jan 17, 2016, at 1:34 PM, Charles C. Berry < [hidden email]> wrote:
>>
>> On Sun, 17 Jan 2016, Lars Bishop wrote:
>>
>>> I’d appreciate your help on understanding the following.
>>
>>> It is not very clear to me from the model.matrix documentation, why simply changing the order of terms in the formula may change the number of resulting columns. Please note I’m purposely not including main effects in the model formula in this case.
>>
>>
>> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or Rdevel, but I cannot seem to lay my browser on them right now.
>>
>>
>>>
>>> set.seed(1)
>>> x1 < rnorm(100)
>>> f1 < factor(sample(letters[1:3], 100, replace = TRUE))
>>> trt < sample(c(1,1), 100, replace = TRUE)
>>> df < data.frame(x1=x1, f1=f1, trt=trt)
>>>
>>> dim(model.matrix( ~ x1:trt + f1:trt, data = df))
>>> [1] 100 4
>>>
>>> dim(model.matrix(~ f1:trt + x1:trt, data = df))
>>> [1] 100 5
>>>
>>
>> By `x1:trt' I guess you mean the same thing as `I(x1*trt)'.
>>
>> If you use the latter form, the issue you raise goes away.
>>
>> Note that `I(some.expr)' gives you the ability to force the behavior of model.matrix to be exactly what you want by suitably crafting `some.expr', heuristics notwithstanding.
>>
>> HTH,
>>
>> Chuck
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


> On 17 Jan 2016, at 19:34 , Charles C. Berry < [hidden email]> wrote:
>
>
> IIRC, there are some heuristics involved harking back to the White Book. I recall there have been discussions of whether and how this could be fixed before on this list and or Rdevel, but I cannot seem to lay my browser on them right now.
>
And IIRC: yup, and one of the issues is that
(a) some rules work lefttoright
(b) the logic is oblivious to the factor/vector distinction
For factors a,b,c, what happens for ~a:b + b:c is that a:b gets the full term expansion since the marginals a and b are not in the model but since b is part of the fully expanded a:b, b:c gets the reduced form expansion as it would in ~b + b:c (the cwithinb thing). Swapping the terms gives you a different result, but at least it is the same model in the sense that the columns span the same subspace.
If a and b are vectors, and c is a factor, you get the same logic: expand a:b fully, then treat b:c as in b + b:c. Unfortunately, a:b is just the product of a and b, whether or not it is fully expanded, so it doesn't really make sense to proceed as if b is contained in a preceding term. So the net result is that you end up with one column less than you probably wanted.

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Is it really the same model?
Following the example provided by Lars:
set.seed(1)
x1 < rnorm(100)
f1 < factor(sample(letters[1:3], 100, replace = TRUE))
trt < sample(c(1,1), 100, replace = TRUE)
y < factor(sample(c(0,1), 100, T))
df < data.frame(y=y, x1=x1, f1=f1, trt=trt)
fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
coef(fit1)
fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
coef(fit2)
identical(fitted(fit1), fitted(fit2))
[1] FALSE
_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the abovenoted email address; please retain a copy of this confirmation for future reference.
Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) cijoint(s) par voie électronique à l'adresse courriel indiquée cidessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On 18 Jan 2016, at 16:49 , Guelman, Leo < [hidden email]> wrote:
> Is it really the same model?
No, and I didn't say that they would be. I did say that they would be in the allfactor case, which does seem to be right:
> df$trt < factor(df$trt)
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # still differs
> df$x1 < factor(sample(c(1,1), 100, replace = TRUE))
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # looks like it's on diagonal
> identical(fitted(fit1), fitted(fit2)) # wrong check
[1] FALSE
> all.equal(fitted(fit1), fitted(fit2)) # better
[1] TRUE
pd
>
> Following the example provided by Lars:
>
> set.seed(1)
> x1 < rnorm(100)
> f1 < factor(sample(letters[1:3], 100, replace = TRUE))
> trt < sample(c(1,1), 100, replace = TRUE)
> y < factor(sample(c(0,1), 100, T))
> df < data.frame(y=y, x1=x1, f1=f1, trt=trt)
>
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> coef(fit1)
>
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> coef(fit2)
>
> identical(fitted(fit1), fitted(fit2))
> [1] FALSE
>
>
>
> _______________________________________________________________________
>
> If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the abovenoted email address; please retain a copy of this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) cijoint(s) par voie électronique à l'adresse courriel indiquée cidessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
>

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Thanks Peter. That make sense. Nevertheless, what comes at a surprise to me (and maybe to others) is that one can potentially get different fits by simply swapping the terms in the model formula.
Best,
Leo.
Original Message
From: peter dalgaard [mailto: [hidden email]]
Sent: 2016, January, 18 11:16 AM
To: Guelman, Leo
Cc: [hidden email]; Charles C. Berry
Subject: Re: [R] Order of formula terms in model.matrix
On 18 Jan 2016, at 16:49 , Guelman, Leo < [hidden email]> wrote:
> Is it really the same model?
No, and I didn't say that they would be. I did say that they would be in the allfactor case, which does seem to be right:
> df$trt < factor(df$trt)
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # still differs
> df$x1 < factor(sample(c(1,1), 100, replace = TRUE))
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> plot(fitted(fit1), fitted(fit2)) # looks like it's on diagonal
> identical(fitted(fit1), fitted(fit2)) # wrong check
[1] FALSE
> all.equal(fitted(fit1), fitted(fit2)) # better
[1] TRUE
pd
>
> Following the example provided by Lars:
>
> set.seed(1)
> x1 < rnorm(100)
> f1 < factor(sample(letters[1:3], 100, replace = TRUE)) trt <
> sample(c(1,1), 100, replace = TRUE) y < factor(sample(c(0,1), 100,
> T)) df < data.frame(y=y, x1=x1, f1=f1, trt=trt)
>
> fit1 < glm(y ~ x1:trt + f1:trt, data = df, family = binomial)
> coef(fit1)
>
> fit2 < glm(y ~ f1:trt + x1:trt, data = df, family = binomial)
> coef(fit2)
>
> identical(fitted(fit1), fitted(fit2))
> [1] FALSE
>
>
>
> ______________________________________________________________________
> _
>
> If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the abovenoted email address; please retain a copy of this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) cijoint(s) par voie électronique à l'adresse courriel indiquée cidessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
>

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
_______________________________________________________________________
If you received this email in error, please advise the sender (by return email or otherwise) immediately. You have consented to receive the attached electronically at the abovenoted email address; please retain a copy of this confirmation for future reference.
Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté de recevoir le(s) document(s) cijoint(s) par voie électronique à l'adresse courriel indiquée cidessus; veuillez conserver une copie de cette confirmation pour les fins de reference future.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

