There are a couple of issues with [.terms that have bitten my survival code. At the
useR conference I promised you a detailed (readable) explanation, and have been lax in
getting it to you. The error was first pointed out in a bugzilla note from 2016, by the
way. The current survival code works around these.
Consider the following formula:
library(survival) # only to get access to the lung data set
test <- Surv(time, status) ~ age + offset(ph.ecog) + strata(inst)
tform <- terms(test, specials="strata")
mf <- model.frame(tform, data=lung)
mterm <- terms(mf)
The strata term is handled in a special way by coxph, and then needs to be removed from
the model formula before calling model.matrix.
To do this the code uses essentially the following, which fails for the formula above.
The root problem is the need for multiple subscripts.
\item The formula itself has length 5, with `~' as the first element
\item The variables and predvars attributes are call objects, each a list() with 4
elments: the response and all 3 predictors
\item The term.labels attribute omits the resonse and the offset, so has length 2
\item The factors attribute has 4 rows and 2 columns
\item The dataClasses attribute is a character vector of length 4
So the ideal result of mterm[remove the specials] would use subscript of
\item [-5] on the formula itself, variables and predvars attributes
\item [-2] for term.labels
\item [-4 , -2, drop=FALSE] for factor attribute
\item [-2] for order attribute
\item [-4] for the dataClasses attribute
That will recreate the formula that ``would have been'' had there been no strata term.
Now look at the first portion of the code in models.R
`[.terms` <- function (termobj, i)
resp <- if (attr(termobj, "response")) termobj[[2L]]
newformula <- attr(termobj, "term.labels")[i]
if (length(newformula) == 0L) newformula <- "1"
newformula <- reformulate(newformula, resp, attr(termobj, "intercept"),
result <- terms(newformula, specials = names(attr(termobj, "specials")))
# Edit the optional attributes
The use of reformulate() is a nice trick. However, the index reported in the specials
attribute is generated with reference to the variables
attribute, or equivalently the row names of the factors attribute, not with respect to the
term.labels attribute. For consistency the second line should instead be
newformula <- row.names(attr(termobj, "factors"))[i]
Of course, this will break code for anyone else who has used [.terms and, like me, has
been adjusting for the ``response is counted in specials but
not in term.labels'' feature. R core will have to discuss/decide what is the right thing
to do, and I'll adapt.
The reformulate trick breaks in another way, one that only appeared on my radar this week
via a formula like the following.
Surv(time, status) ~ age + (sex=='male') + strata(inst)
In both the term.labels attribute and the row/col names of the factors attribute the
parentheses disappear, and the result of the reformulate call is not a proper formula.
The + binds tighter than == leading to an error message that will confuse most users. We
can argue, and I probably would, that the user should have used I(sex=='male'). But they
didn't, and without the I() it is a legal formula, or at least one that currently works.
Fixing this issue is a lot harder.
An offset term causes issues in the 'Edit the optional attributes' part of the routine as
well. If you and/or R core will tell me what you think
the code should do, I'll create a patch. My vote would be to use rownames per the above
and ignore the () edge case.
The same basic code appears in drop.terms, by the way.