Some questions about R's modelling algebra

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Some questions about R's modelling algebra

Hadley Wickham-2
Hi all,

In preparation for teaching a class next week, I've been reviewing R's
standard modelling algebra. I've used it for a long time and have a
pretty good intuitive feel for how it works, but would like to
understand more of the technical details. The best (online) reference
I've found so far is the section in "An Introduction to R"
(http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models).
Does anyone have any other suggestions?

I have a few questions about the definitions given in "An Introduction to R":

 * "M_1 : M_2 - The tensor product of M_1 and M_2. If both terms are
factors, then the “subclasses” factor."

   From my reading, the usual interpretation of a tensor product when
x and y are vectors is the outer product.  I don't see how that would
work here - how does a matrix work as an predictor in a linear model?
In what sense is the tensor product of x with itself equal to x?

  What is the subclasses factor? Is it interaction(M_1, M_2, sep = "")?

 * "M_1 %in% M_2 - Similar to M_1:M_2, but with a different coding."

  How is the coding different?

  Where is %in% documented within R?  I'm pretty sure it's a different
action to ?"%in%, and it's not mentioned in ?formula

I have also read G. N. Wilkinson and C. E. Rogers. Symbolic
descriptions of factorial models for analysis of variance. Journal of
the Royal Statistical Society. Series C (Applied Statistics),
22:392–399, 1973. - Can anyone comment on any important differences to
R's modelling algebra? What does %in% correspond to in Wilkinson and
Rogers' framework?

Thanks!

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

slre


>>> Hadley Wickham <[hidden email]> 02/07/2010 14:59:53 >>>
> Where is %in% documented within R?  I'm pretty sure it's a different
>action to ?"%in%, and it's not mentioned in ?formula

?formula in R 2.9.2 says in para 2:
"The %in% operator indicates that the terms on its left are nested
within those on the right. For example a + b %in% a expands to the
formula a + a:b. "



*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

Hadley Wickham-2
> ?formula in R 2.9.2 says in para 2:
> "The %in% operator indicates that the terms on its left are nested
> within those on the right. For example a + b %in% a expands to the
> formula a + a:b. "

Ooops, missed that.  So b %in% a = a:b, and that's what's meant by
"different coding".

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

Richard M. Heiberger
In reply to this post by Hadley Wickham-2
Hadley,

The S language modeling language was designed with Wilkinson and
Rogers in mind.  The notation was changed from their paper to
retain consistency with the parsing rules for ordinary algebra in
S.  I think of ":" as an indicator of an indexing system into the
dummy variables.  It is not an indicator of degrees of freedom.

For simplicity in notation, let A be a factor with a levels and B
be a factor with b levels.  Then A:B implies a set of dummy
variables with at most ab columns indexed by an A level and a B
level.  The degrees of freedom associated with A:B depends on the
linear dependencies of the associated dummy variables with the
dummy variables of other terms in the model.  The excess columns
can be suppressed when the dummy variables are generated or they
can be pivoted out during the analysis.  When we have the special
case A:A, there is only one factor mentioned, so the indexing
scheme is based on just the one factor.  You could generate the
full set of a^2 columns, and then you would discover that they
are all linearly dependent on the first a.


The columns can be labeled either
a1b1 a1b2 a1b3 a2b1 a2b2 a2b3
or
a1b1 a2b1 a1b2 a2b2 a1b3 a2b3

If there is crossing, we would report the a single sum of squares
and degrees of freedom for the interaction.  If there is nesting,
say a/b , then it might make sense to group the dummy variables
say (a1b1 a1b2 a1b3) and (a2b1 a2b2 a2b3) and report simple
effects sum of squares and degrees of freedom for each of the
groups.
The structure of the individual columns depends on the set of
contrasts used for the A and B factors.

Rich

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

Thomas Lumley
In reply to this post by Hadley Wickham-2
On Fri, 2 Jul 2010, Hadley Wickham wrote:

> Hi all,
>
> In preparation for teaching a class next week, I've been reviewing R's
> standard modelling algebra. I've used it for a long time and have a
> pretty good intuitive feel for how it works, but would like to
> understand more of the technical details. The best (online) reference
> I've found so far is the section in "An Introduction to R"
> (http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models).
> Does anyone have any other suggestions?
>
> I have a few questions about the definitions given in "An Introduction to R":
>
> * "M_1 : M_2 - The tensor product of M_1 and M_2. If both terms are
> factors, then the “subclasses” factor."
>
>   From my reading, the usual interpretation of a tensor product when
> x and y are vectors is the outer product.  I don't see how that would
> work here - how does a matrix work as an predictor in a linear model?
Think of it for a single observation.  x and y specify terms that could be scalars or could be row vectors (eg ns(x), poly(y,3)), and the terms
in x:y are the products of each term from x with each term from y.    Like taking the Kronecker product and then reshaping it back into a row vector.


> In what sense is the tensor product of x with itself equal to x?

This is the messy bit.  The 'product'  operator is not the arithmetic product, because x:x is not the same as x:z even if z=x.

The product of a set of single-column terms is formed by  eliminating any terms from the set that are syntactically duplicates and then taking the arithmetic product of the remaining terms.  This is the Right Thing for producing design matrices, but is a bit of a mess to describe.

So  x:z:log(z) contains no duplicates and produces x*z*log(z).  x:z contains no duplicates and produces x*z (even if z=x), but x:z:x produces x*z and x:x produces x.


>  What is the subclasses factor? Is it interaction(M_1, M_2, sep = "")?

Yes.


You might find the Wilkinson & Rogers paper more helpful:

@Article{Wilkinson.Rogers.73,
   author       = "G. N. Wilkinson and C. E. Rogers",
   title        = "Symbolic description of factorial models for analysis of
                   variance",
   journal      = "Applied Statistics",
   volume       = "22",
   pages        = "392--399",
   year         = "1973",
   comment      = "Reference from MASS",
}

The notation is slightly different; R uses ':' for their '.' and '^' for their '**'.  I think the algebra is the same.

     -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

Adrian Waddell
In reply to this post by Hadley Wickham-2
> Hadley Wickham <hadley <at> rice.edu>
>
>   Where is %in% documented within R?  I'm pretty sure it's a different
> action to ?"%in%, and it's not mentioned in ?formula

You find the documentation for operators like <-, %in%, if, etc by putting
the operators between
qoutes

?"%in%"
?"<-"
?"if"

Regards,

Adrian

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about R's modelling algebra

Kingsford Jones
In reply to this post by Hadley Wickham-2
On Fri, Jul 2, 2010 at 8:16 AM, Hadley Wickham <[hidden email]> wrote:
>> ?formula in R 2.9.2 says in para 2:
>> "The %in% operator indicates that the terms on its left are nested
>> within those on the right. For example a + b %in% a expands to the
>> formula a + a:b. "
>
> Ooops, missed that.  So b %in% a = a:b, and that's what's meant by
> "different coding".

Or would this be true only if "b %in% a" was preceded by "a"?

attr(terms(y ~ B %in% A), 'term.labels')
#[1] "B:A"
attr(terms(y ~ B + B %in% A), 'term.labels')
#[1] "B"   "B:A"
attr(terms(y ~ A + B %in% A), 'term.labels')
#[1] "A"   "A:B"

suggesting a documentation buglet in Sec 11.1 of An Introduction to R,
where it states:

\begin{quote}
y ~ A*B
y ~ A + B + A:B
y ~ B %in% A
y ~ A/B
    Two factor non-additive model of y on A and B. The first two
specify the same crossed classification and the second two specify the
same nested classification. In abstract terms all four specify the
same model subspace.
\end{quote}

I think "y ~ B %in% A"  should be changed to "y ~ A + B %in% A" since

attr(terms(y ~ A/B), 'term.labels')
#[1] "A"   "A:B"

Or am I missing something?

Kingsford


>
> Hadley
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.