Interpreting model matrix columns when using contr.sum

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Interpreting model matrix columns when using contr.sum

Gang Chen-4
With the following example using contr.sum for both factors,

> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum", b="contr.sum"))

   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1            1  1  0  1  0  0     1     0     0     0     0     0
2            1  1  0  0  1  0     0     0     1     0     0     0
3            1  1  0  0  0  1     0     0     0     0     1     0
4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
5            1  0  1  1  0  0     0     1     0     0     0     0
6            1  0  1  0  1  0     0     0     0     1     0     0
7            1  0  1  0  0  1     0     0     0     0     0     1
8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
...

I have two questions:

(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean. Is
this interpretation correct?

(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?

Thanks in advance for your help,
Gang

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Interpreting model matrix columns when using contr.sum

Douglas Bates-2
On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <[hidden email]> wrote:

> With the following example using contr.sum for both factors,
>
>> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
>> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum", b="contr.sum"))
>
>   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
> 1            1  1  0  1  0  0     1     0     0     0     0     0
> 2            1  1  0  0  1  0     0     0     1     0     0     0
> 3            1  1  0  0  0  1     0     0     0     0     1     0
> 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
> 5            1  0  1  1  0  0     0     1     0     0     0     0
> 6            1  0  1  0  1  0     0     0     0     1     0     0
> 7            1  0  1  0  0  1     0     0     0     0     0     1
> 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
> 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
> 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
> 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
> 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
> ...

> I have two questions:

> (1) I assume the 1st column (under intercept) is the overall mean, the
> 2rd column (under a1) is the difference between the 1st level of
> factor a and the overall mean, the 4th column (under b1) is the
> difference between the 1st level of factor b and the overall mean.

> Is this interpretation correct?

I don't think so and furthermore I don't see why the contrasts should
have an interpretation.  The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors.  Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.

If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate

> contr.sum(3)
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1
> solve(cbind(1, contr.sum(3)))
              1          2          3
[1,]  0.3333333  0.3333333  0.3333333
[2,]  0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333  0.6666667 -0.3333333
> solve(cbind(1, contr.sum(4)))
         1     2     3     4
[1,]  0.25  0.25  0.25  0.25
[2,]  0.75 -0.25 -0.25 -0.25
[3,] -0.25  0.75 -0.25 -0.25
[4,] -0.25 -0.25  0.75 -0.25

That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.

> (2) I'm not so sure about those interaction columns. For example, what
> is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
> versus the overall mean, or something more complicated?

Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns.  You need a basis for a certain subspace and this
provides one.  I don't see why there must be interpretations of the
coefficients.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Interpreting model matrix columns when using contr.sum

John Fox
Dear Doug and Gang Chen,

With balanced data and sum-to-zero contrasts, the intercept is indeed the
general mean of the response; the coefficient of a1 is the mean of the
response in category a1 minus the general mean; the coefficient of a1:b1 is
the mean of the response in cell a1, b1 minus the general mean and the
coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
intercept is the mean of the cell means; the coefficient of a1 is the mean
of cell means at level a1 minus the intercept; etc. Whether all this is of
interest is another question, since a simple graph of cell means tells a
more digestible story about the data.

Regards,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
On

> Behalf Of Douglas Bates
> Sent: January-25-09 10:49 AM
> To: Gang Chen
> Cc: R-help
> Subject: Re: [R] Interpreting model matrix columns when using contr.sum
>
> On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <[hidden email]> wrote:
> > With the following example using contr.sum for both factors,
> >
> >> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
> >> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
b="contr.sum"))

> >
> >   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
> > 1            1  1  0  1  0  0     1     0     0     0     0     0
> > 2            1  1  0  0  1  0     0     0     1     0     0     0
> > 3            1  1  0  0  0  1     0     0     0     0     1     0
> > 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
> > 5            1  0  1  1  0  0     0     1     0     0     0     0
> > 6            1  0  1  0  1  0     0     0     0     1     0     0
> > 7            1  0  1  0  0  1     0     0     0     0     0     1
> > 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
> > 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
> > 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
> > 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
> > 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
> > ...
>
> > I have two questions:
>
> > (1) I assume the 1st column (under intercept) is the overall mean, the
> > 2rd column (under a1) is the difference between the 1st level of
> > factor a and the overall mean, the 4th column (under b1) is the
> > difference between the 1st level of factor b and the overall mean.
>
> > Is this interpretation correct?
>
> I don't think so and furthermore I don't see why the contrasts should
> have an interpretation.  The contrasts are simply a parameterization
> of the space spanned by the indicator columns of the levels of the
> factors.  Interpretations as overall means, etc. are mostly a holdover
> from antiquated concepts of how analysis of variance tables should be
> evalated.
>
> If you want to determine the interpretation of particular coefficients
> for the special case of a balanced design (which doesn't always mean a
> resulting balanced data set - I remind my students that expecting a
> balanced design to produce balanced data is contrary to Murphy's Law)
> the easiest way of doing so is (I think this is right but I can
> somehow manage to confuse myself on this with great ease) to calculate
>
> > contr.sum(3)
>   [,1] [,2]
> 1    1    0
> 2    0    1
> 3   -1   -1
> > solve(cbind(1, contr.sum(3)))
>               1          2          3
> [1,]  0.3333333  0.3333333  0.3333333
> [2,]  0.6666667 -0.3333333 -0.3333333
> [3,] -0.3333333  0.6666667 -0.3333333
> > solve(cbind(1, contr.sum(4)))
>          1     2     3     4
> [1,]  0.25  0.25  0.25  0.25
> [2,]  0.75 -0.25 -0.25 -0.25
> [3,] -0.25  0.75 -0.25 -0.25
> [4,] -0.25 -0.25  0.75 -0.25
>
> That is, the first coefficient is the "overall mean" (but only for a
> balanced data set), the second is a contrast of the first level with
> the others, the third is a contrast of the second level with the
> others and so on.
>
> > (2) I'm not so sure about those interaction columns. For example, what
> > is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
> > versus the overall mean, or something more complicated?
>
> Well, at the risk of sounding trivial, a1:b1 is the product of the a1
> and b1 columns.  You need a basis for a certain subspace and this
> provides one.  I don't see why there must be interpretations of the
> coefficients.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Interpreting model matrix columns when using contr.sum

Gang Chen-4
Many thanks to both Drs. Bates and Fox for the help!

I also figured out yesterday what Dr. Fox just said regarding the
interpretations of those coefficients for a balanced design. Thanks
Dr. Bates for the suggestion of using solve(cbind(1, contr.sum(4))) to
sort out the factor level effects. Model validation is very important,
but interpreting those coefficients, at least in the case of balanced
designs, also provides some insights about various effects for the
people working in the field.

Gang


On Sun, Jan 25, 2009 at 11:25 AM, John Fox <[hidden email]> wrote:

> Dear Doug and Gang Chen,
>
> With balanced data and sum-to-zero contrasts, the intercept is indeed the
> general mean of the response; the coefficient of a1 is the mean of the
> response in category a1 minus the general mean; the coefficient of a1:b1 is
> the mean of the response in cell a1, b1 minus the general mean and the
> coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
> intercept is the mean of the cell means; the coefficient of a1 is the mean
> of cell means at level a1 minus the intercept; etc. Whether all this is of
> interest is another question, since a simple graph of cell means tells a
> more digestible story about the data.
>
> Regards,
>  John
>
> ------------------------------
> John Fox, Professor
> Department of Sociology
> McMaster University
> Hamilton, Ontario, Canada
> web: socserv.mcmaster.ca/jfox
>
>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
> On
>> Behalf Of Douglas Bates
>> Sent: January-25-09 10:49 AM
>> To: Gang Chen
>> Cc: R-help
>> Subject: Re: [R] Interpreting model matrix columns when using contr.sum
>>
>> On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <[hidden email]> wrote:
>> > With the following example using contr.sum for both factors,
>> >
>> >> dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
>> >> model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
> b="contr.sum"))
>> >
>> >   (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
>> > 1            1  1  0  1  0  0     1     0     0     0     0     0
>> > 2            1  1  0  0  1  0     0     0     1     0     0     0
>> > 3            1  1  0  0  0  1     0     0     0     0     1     0
>> > 4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
>> > 5            1  0  1  1  0  0     0     1     0     0     0     0
>> > 6            1  0  1  0  1  0     0     0     0     1     0     0
>> > 7            1  0  1  0  0  1     0     0     0     0     0     1
>> > 8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
>> > 9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
>> > 10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
>> > 11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
>> > 12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
>> > ...
>>
>> > I have two questions:
>>
>> > (1) I assume the 1st column (under intercept) is the overall mean, the
>> > 2rd column (under a1) is the difference between the 1st level of
>> > factor a and the overall mean, the 4th column (under b1) is the
>> > difference between the 1st level of factor b and the overall mean.
>>
>> > Is this interpretation correct?
>>
>> I don't think so and furthermore I don't see why the contrasts should
>> have an interpretation.  The contrasts are simply a parameterization
>> of the space spanned by the indicator columns of the levels of the
>> factors.  Interpretations as overall means, etc. are mostly a holdover
>> from antiquated concepts of how analysis of variance tables should be
>> evalated.
>>
>> If you want to determine the interpretation of particular coefficients
>> for the special case of a balanced design (which doesn't always mean a
>> resulting balanced data set - I remind my students that expecting a
>> balanced design to produce balanced data is contrary to Murphy's Law)
>> the easiest way of doing so is (I think this is right but I can
>> somehow manage to confuse myself on this with great ease) to calculate
>>
>> > contr.sum(3)
>>   [,1] [,2]
>> 1    1    0
>> 2    0    1
>> 3   -1   -1
>> > solve(cbind(1, contr.sum(3)))
>>               1          2          3
>> [1,]  0.3333333  0.3333333  0.3333333
>> [2,]  0.6666667 -0.3333333 -0.3333333
>> [3,] -0.3333333  0.6666667 -0.3333333
>> > solve(cbind(1, contr.sum(4)))
>>          1     2     3     4
>> [1,]  0.25  0.25  0.25  0.25
>> [2,]  0.75 -0.25 -0.25 -0.25
>> [3,] -0.25  0.75 -0.25 -0.25
>> [4,] -0.25 -0.25  0.75 -0.25
>>
>> That is, the first coefficient is the "overall mean" (but only for a
>> balanced data set), the second is a contrast of the first level with
>> the others, the third is a contrast of the second level with the
>> others and so on.
>>
>> > (2) I'm not so sure about those interaction columns. For example, what
>> > is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
>> > versus the overall mean, or something more complicated?
>>
>> Well, at the risk of sounding trivial, a1:b1 is the product of the a1
>> and b1 columns.  You need a basis for a certain subspace and this
>> provides one.  I don't see why there must be interpretations of the
>> coefficients.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...