inconsistent handling of factor, character, and logical predictors in lm()

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

inconsistent handling of factor, character, and logical predictors in lm()

Fox, John
Dear R-devel list members,

I've discovered an inconsistency in how lm() and similar functions handle logical predictors as opposed to factor or character predictors. An "lm" object for a model that includes factor or character predictors includes the levels of a factor or unique values of a character predictor in the $xlevels component of the object, but not the FALSE/TRUE values for a logical predictor even though the latter is treated as a factor in the fit.

For example:

------------ snip --------------

> m1 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
> m1$xlevels
$Species
[1] "setosa"     "versicolor" "virginica"
 
> m2 <- lm(Sepal.Length ~ Sepal.Width + as.character(Species), data=iris)
> m2$xlevels
$`as.character(Species)`
[1] "setosa"     "versicolor" "virginica"

> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3$xlevels
named list()

> m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
    data = iris)

Coefficients:
               (Intercept)                 Sepal.Width  I(Species == "setosa")TRUE  
                    3.5571                      0.9418                     -1.7797  

------------ snip --------------

I believe that the culprit is .getXlevels(), which makes provision for factor and character predictors but not for logical predictors:

------------ snip --------------

> .getXlevels
function (Terms, m)
{
    xvars <- vapply(attr(Terms, "variables"), deparse2,
        "")[-1L]
    if ((yvar <- attr(Terms, "response")) > 0)
        xvars <- xvars[-yvar]
    if (length(xvars)) {
        xlev <- lapply(m[xvars], function(x) if (is.factor(x))
            levels(x)
        else if (is.character(x))
            levels(as.factor(x)))
        xlev[!vapply(xlev, is.null, NA)]
    }
}

------------ snip --------------

It would be simple to modify the last test in .getXlevels to

        else if (is.character(x) || is.logical(x))

which would cause .getXlevels() to return c("FALSE", "TRUE") (assuming both values are present in the data). I'd find that sufficient, but alternatively there could be a separate test for logical predictors that returns c(FALSE, TRUE).

I discovered this issue when a function in the effects package failed for a model with a logical predictor. Although it's possible to program around the problem, I think that it would be better to handle factors, character predictors, and logical predictors consistently.

Best,
 John

--------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: socialsciences.mcmaster.ca/jfox/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: inconsistent handling of factor, character, and logical predictors in lm()

Abby Spurdle
> I think that it would be better to handle factors, character predictors, and logical predictors consistently.

"logical predictors" can be regarded as categorical or continuous (i.e. 0 or 1).
And the model matrix should be the same, either way.

I think the first question to be asked is, which is the best approach,
categorical or continuous?
The continuous approach seems simpler and more efficient to me, but
output from the categorical approach may be more intuitive, for some
people.

I note that the use factors and characters, doesn't necessarily
produce consistent output, for $xlevels.
(Because factors can have their levels re-ordered).

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: inconsistent handling of factor, character, and logical predictors in lm()

Fox, John
Dear Abby,

> On Aug 30, 2019, at 8:20 PM, Abby Spurdle <[hidden email]> wrote:
>
>> I think that it would be better to handle factors, character predictors, and logical predictors consistently.
>
> "logical predictors" can be regarded as categorical or continuous (i.e. 0 or 1).
> And the model matrix should be the same, either way.

I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables.

That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation):

------------ snip --------------

> options(contrasts=c("contr.sum", "contr.poly"))
> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
    data = iris)

Coefficients:
            (Intercept)              Sepal.Width  I(Species == "setosa")1  
                 2.6672                   0.9418                   0.8898  

> head(model.matrix(m3))
  (Intercept) Sepal.Width I(Species == "setosa")1
1           1         3.5                      -1
2           1         3.0                      -1
3           1         3.2                      -1
4           1         3.1                      -1
5           1         3.6                      -1
6           1         3.9                      -1
> tail(model.matrix(m3))
    (Intercept) Sepal.Width I(Species == "setosa")1
145           1         3.3                       1
146           1         3.0                       1
147           1         2.5                       1
148           1         3.0                       1
149           1         3.4                       1
150           1         3.0                       1

> lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data=iris)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
    "setosa"), data = iris)

Coefficients:
                    (Intercept)                      Sepal.Width  as.numeric(Species == "setosa")  
                         3.5571                           0.9418                          -1.7797  

> -2*coef(m3)[3]
I(Species == "setosa")1
              -1.779657

------------ snip --------------


>
> I think the first question to be asked is, which is the best approach,
> categorical or continuous?
> The continuous approach seems simpler and more efficient to me, but
> output from the categorical approach may be more intuitive, for some
> people.

I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical.

>
> I note that the use factors and characters, doesn't necessarily
> produce consistent output, for $xlevels.
> (Because factors can have their levels re-ordered).

Again, this misses the point: Both factors and character predictors produce elements in $xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about $xlevels.

Best,
 John

  -------------------------------------------------
  John Fox, Professor Emeritus
  McMaster University
  Hamilton, Ontario, Canada
  Web: http::/socserv.mcmaster.ca/jfox

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: inconsistent handling of factor, character, and logical predictors in lm()

R devel mailing list
> Functions like lm() treat logical predictors as factors, *not* as
numerical variables.

Not quite.  A factor with all elements the same causes lm() to give an
error while a logical of all TRUEs or all FALSEs just omits it from the
model (it gets a coefficient of NA).  This is a fairly common situation
when you fit models to subsets of a big data.frame.  This is an argument
for fixing the single-valued-factor problem, which would become more
noticeable if logicals were treated as factors.

 > d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750),
Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))
> coef(lm(data=d, Weight ~ Age + Diseased))
 (Intercept)          Age DiseasedTRUE
    877.7333       5.4000    -151.3333
> coef(lm(data=d, Weight ~ Age + factor(Diseased)))
         (Intercept)                  Age factor(Diseased)TRUE
            877.7333               5.4000            -151.3333
> coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))
 (Intercept)          Age DiseasedTRUE
    847.3333      13.0000           NA
> coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels
> coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),
subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more levels

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Sat, Aug 31, 2019 at 8:54 AM Fox, John <[hidden email]> wrote:

> Dear Abby,
>
> > On Aug 30, 2019, at 8:20 PM, Abby Spurdle <[hidden email]> wrote:
> >
> >> I think that it would be better to handle factors, character
> predictors, and logical predictors consistently.
> >
> > "logical predictors" can be regarded as categorical or continuous (i.e.
> 0 or 1).
> > And the model matrix should be the same, either way.
>
> I think that you're mistaking a coincidence for a principle. The
> coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE.
> Functions like lm() treat logical predictors as factors, *not* as numerical
> variables.
>
> That one would get the same coefficient in either case is a consequence of
> the coincidence and the fact that the default contrasts for unordered
> factors are contr.treatment(). For example, if you changed the contrasts
> option, you'd get a different estimate (though of course a model with the
> same fit to the data and an equivalent interpretation):
>
> ------------ snip --------------
>
> > options(contrasts=c("contr.sum", "contr.poly"))
> > m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
> > m3
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
>     data = iris)
>
> Coefficients:
>             (Intercept)              Sepal.Width  I(Species == "setosa")1
>                  2.6672                   0.9418                   0.8898
>
> > head(model.matrix(m3))
>   (Intercept) Sepal.Width I(Species == "setosa")1
> 1           1         3.5                      -1
> 2           1         3.0                      -1
> 3           1         3.2                      -1
> 4           1         3.1                      -1
> 5           1         3.6                      -1
> 6           1         3.9                      -1
> > tail(model.matrix(m3))
>     (Intercept) Sepal.Width I(Species == "setosa")1
> 145           1         3.3                       1
> 146           1         3.0                       1
> 147           1         2.5                       1
> 148           1         3.0                       1
> 149           1         3.4                       1
> 150           1         3.0                       1
>
> > lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),
> data=iris)
>
> Call:
> lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
>     "setosa"), data = iris)
>
> Coefficients:
>                     (Intercept)                      Sepal.Width
> as.numeric(Species == "setosa")
>                          3.5571                           0.9418
>                 -1.7797
>
> > -2*coef(m3)[3]
> I(Species == "setosa")1
>               -1.779657
>
> ------------ snip --------------
>
>
> >
> > I think the first question to be asked is, which is the best approach,
> > categorical or continuous?
> > The continuous approach seems simpler and more efficient to me, but
> > output from the categorical approach may be more intuitive, for some
> > people.
>
> I think that this misses the point I was trying to make: lm() et al. treat
> logical variables as factors, not as numerical predictors. One could argue
> about what's the better approach but not about what lm() does. BTW, I
> prefer treating a logical predictor as a factor because the predictor is
> essentially categorical.
>
> >
> > I note that the use factors and characters, doesn't necessarily
> > produce consistent output, for $xlevels.
> > (Because factors can have their levels re-ordered).
>
> Again, this misses the point: Both factors and character predictors
> produce elements in $xlevels; logical predictors do not, even though they
> are treated in the model as factors. That factors have levels that aren't
> necessarily ordered alphabetically is a reason that I prefer using factors
> to using character predictors, but this has nothing to do with the point I
> was trying to make about $xlevels.
>
> Best,
>  John
>
>   -------------------------------------------------
>   John Fox, Professor Emeritus
>   McMaster University
>   Hamilton, Ontario, Canada
>   Web: http::/socserv.mcmaster.ca/jfox
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: inconsistent handling of factor, character, and logical predictors in lm()

Fox, John
In reply to this post by Fox, John
Dear Bill,

Thanks for pointing this difference out -- I was unaware of it.

I think that the difference occurs in model.matrix.default(), which coerces character variables but not logical variables to factors. Later it treats both factors and logical variables as "factors" in that it applies contrasts to both, but unused factor levels are dropped while an unused logical level is not.

I don't see why logical variables shouldn't be treated just as character variables are currently, both with respect to single levels (whether this is considered an error or as collinear with the intercept and thus gets an NA coefficient) and with respect to $levels.

Best,
 John

> On Aug 31, 2019, at 1:21 PM, William Dunlap via R-devel <[hidden email]> wrote:
>
>> Functions like lm() treat logical predictors as factors, *not* as
> numerical variables.
>
> Not quite.  A factor with all elements the same causes lm() to give an
> error while a logical of all TRUEs or all FALSEs just omits it from the
> model (it gets a coefficient of NA).  This is a fairly common situation
> when you fit models to subsets of a big data.frame.  This is an argument
> for fixing the single-valued-factor problem, which would become more
> noticeable if logicals were treated as factors.
>
>> d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750),
> Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))
>> coef(lm(data=d, Weight ~ Age + Diseased))
> (Intercept)          Age DiseasedTRUE
>    877.7333       5.4000    -151.3333
>> coef(lm(data=d, Weight ~ Age + factor(Diseased)))
>         (Intercept)                  Age factor(Diseased)TRUE
>            877.7333               5.4000            -151.3333
>> coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))
> (Intercept)          Age DiseasedTRUE
>    847.3333      13.0000           NA
>> coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))
> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>  contrasts can be applied only to factors with 2 or more levels
>> coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),
> subset=Age<7))
> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>  contrasts can be applied only to factors with 2 or more levels
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Sat, Aug 31, 2019 at 8:54 AM Fox, John <[hidden email]> wrote:
>
>> Dear Abby,
>>
>>> On Aug 30, 2019, at 8:20 PM, Abby Spurdle <[hidden email]> wrote:
>>>
>>>> I think that it would be better to handle factors, character
>> predictors, and logical predictors consistently.
>>>
>>> "logical predictors" can be regarded as categorical or continuous (i.e.
>> 0 or 1).
>>> And the model matrix should be the same, either way.
>>
>> I think that you're mistaking a coincidence for a principle. The
>> coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE.
>> Functions like lm() treat logical predictors as factors, *not* as numerical
>> variables.
>>
>> That one would get the same coefficient in either case is a consequence of
>> the coincidence and the fact that the default contrasts for unordered
>> factors are contr.treatment(). For example, if you changed the contrasts
>> option, you'd get a different estimate (though of course a model with the
>> same fit to the data and an equivalent interpretation):
>>
>> ------------ snip --------------
>>
>>> options(contrasts=c("contr.sum", "contr.poly"))
>>> m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
>>> m3
>>
>> Call:
>> lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
>>    data = iris)
>>
>> Coefficients:
>>            (Intercept)              Sepal.Width  I(Species == "setosa")1
>>                 2.6672                   0.9418                   0.8898
>>
>>> head(model.matrix(m3))
>>  (Intercept) Sepal.Width I(Species == "setosa")1
>> 1           1         3.5                      -1
>> 2           1         3.0                      -1
>> 3           1         3.2                      -1
>> 4           1         3.1                      -1
>> 5           1         3.6                      -1
>> 6           1         3.9                      -1
>>> tail(model.matrix(m3))
>>    (Intercept) Sepal.Width I(Species == "setosa")1
>> 145           1         3.3                       1
>> 146           1         3.0                       1
>> 147           1         2.5                       1
>> 148           1         3.0                       1
>> 149           1         3.4                       1
>> 150           1         3.0                       1
>>
>>> lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),
>> data=iris)
>>
>> Call:
>> lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
>>    "setosa"), data = iris)
>>
>> Coefficients:
>>                    (Intercept)                      Sepal.Width
>> as.numeric(Species == "setosa")
>>                         3.5571                           0.9418
>>                -1.7797
>>
>>> -2*coef(m3)[3]
>> I(Species == "setosa")1
>>              -1.779657
>>
>> ------------ snip --------------
>>
>>
>>>
>>> I think the first question to be asked is, which is the best approach,
>>> categorical or continuous?
>>> The continuous approach seems simpler and more efficient to me, but
>>> output from the categorical approach may be more intuitive, for some
>>> people.
>>
>> I think that this misses the point I was trying to make: lm() et al. treat
>> logical variables as factors, not as numerical predictors. One could argue
>> about what's the better approach but not about what lm() does. BTW, I
>> prefer treating a logical predictor as a factor because the predictor is
>> essentially categorical.
>>
>>>
>>> I note that the use factors and characters, doesn't necessarily
>>> produce consistent output, for $xlevels.
>>> (Because factors can have their levels re-ordered).
>>
>> Again, this misses the point: Both factors and character predictors
>> produce elements in $xlevels; logical predictors do not, even though they
>> are treated in the model as factors. That factors have levels that aren't
>> necessarily ordered alphabetically is a reason that I prefer using factors
>> to using character predictors, but this has nothing to do with the point I
>> was trying to make about $xlevels.
>>
>> Best,
>> John
>>
>>  -------------------------------------------------
>>  John Fox, Professor Emeritus
>>  McMaster University
>>  Hamilton, Ontario, Canada
>>  Web: http::/socserv.mcmaster.ca/jfox
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: inconsistent handling of factor, character, and logical predictors in lm()

Abby Spurdle
In reply to this post by Fox, John
> I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors.

I'm unenthusiastic about mapping TRUE to -1 and FALSE to 1, in the model matrix.
(I nearly got that back the front).

However, I've decided to agree with your original suggestion,
regarding $xlevels.
I think it should include the logical levels, if that's the right term...

However, I note that the output still won't be completely consistent.
Because one case leads to a logical vector and the other cases lead to
character vectors.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel