lm() takes weights from formula environment

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

lm() takes weights from formula environment

John Mount
I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE).

lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue.

Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment.


d <- data.frame(x = 1:3, y = c(3, 3, 4))
w <- c(1, 5, 1)

# works
lm(y ~ x, data = d, weights = w)  

# fails, as weights are taken from formul environment
fn <- function() {  # deliberately set up formula with bad value in environment
  w <- c(-1, -1, -1, -1)  # bad weights
  f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = parent.frame()) default
  return(f)
}
lm(fn(), data = d, weights = w)
# Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) :
#   variable lengths differ (found for '(weights)')

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

Duncan Murdoch-2
This is fairly clearly documented in ?lm:

"All of weights, subset and offset are evaluated in the same way as
variables in formula, that is first in data and then in the environment
of formula."

There are lots of possible places to look for weights, but this seems to
me like a pretty sensible search order.  In most cases the environment
of the formula will have a parent environment chain that eventually
leads to the global environment, so (with no conflicts) your strategy of
defining w there will sometimes work, but looks pretty unreliable.

When you say you want to work around this search order, I think the
obvious way is to add your w vector to your d dataframe.  That way it is
guaranteed to be found even if there's a conflicting variable in the
formula environment, or the global environment.

Duncan Murdoch

On 09/08/2020 2:13 p.m., John Mount wrote:

> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE).
>
> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue.
>
> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment.
>
>
> d <- data.frame(x = 1:3, y = c(3, 3, 4))
> w <- c(1, 5, 1)
>
> # works
> lm(y ~ x, data = d, weights = w)
>
> # fails, as weights are taken from formul environment
> fn <- function() {  # deliberately set up formula with bad value in environment
>    w <- c(-1, -1, -1, -1)  # bad weights
>    f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = parent.frame()) default
>    return(f)
> }
> lm(fn(), data = d, weights = w)
> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) :
> #   variable lengths differ (found for '(weights)')
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

John Mount
Doesn't this preclude "y ~ ." style notations?

> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
>
> This is fairly clearly documented in ?lm:
>
> "All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula."
>
> There are lots of possible places to look for weights, but this seems to me like a pretty sensible search order.  In most cases the environment of the formula will have a parent environment chain that eventually leads to the global environment, so (with no conflicts) your strategy of defining w there will sometimes work, but looks pretty unreliable.
>
> When you say you want to work around this search order, I think the obvious way is to add your w vector to your d dataframe.  That way it is guaranteed to be found even if there's a conflicting variable in the formula environment, or the global environment.
>
> Duncan Murdoch
>
> On 09/08/2020 2:13 p.m., John Mount wrote:
>> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE).
>> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue.
>> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment.
>> d <- data.frame(x = 1:3, y = c(3, 3, 4))
>> w <- c(1, 5, 1)
>> # works
>> lm(y ~ x, data = d, weights = w)
>> # fails, as weights are taken from formul environment
>> fn <- function() {  # deliberately set up formula with bad value in environment
>>   w <- c(-1, -1, -1, -1)  # bad weights
>>   f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = parent.frame()) default
>>   return(f)
>> }
>> lm(fn(), data = d, weights = w)
>> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) :
>> #   variable lengths differ (found for '(weights)')
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

Duncan Murdoch-2
On 09/08/2020 3:01 p.m., John Mount wrote:
> Doesn't this preclude "y ~ ." style notations?

Yes, but you can use "y ~ . - w".

Duncan Murdoch


>
>> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
>>
>> This is fairly clearly documented in ?lm:
>>
>> "All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula."
>>
>> There are lots of possible places to look for weights, but this seems to me like a pretty sensible search order.  In most cases the environment of the formula will have a parent environment chain that eventually leads to the global environment, so (with no conflicts) your strategy of defining w there will sometimes work, but looks pretty unreliable.
>>
>> When you say you want to work around this search order, I think the obvious way is to add your w vector to your d dataframe.  That way it is guaranteed to be found even if there's a conflicting variable in the formula environment, or the global environment.
>>
>> Duncan Murdoch
>>
>> On 09/08/2020 2:13 p.m., John Mount wrote:
>>> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE).
>>> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue.
>>> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment.
>>> d <- data.frame(x = 1:3, y = c(3, 3, 4))
>>> w <- c(1, 5, 1)
>>> # works
>>> lm(y ~ x, data = d, weights = w)
>>> # fails, as weights are taken from formul environment
>>> fn <- function() {  # deliberately set up formula with bad value in environment
>>>    w <- c(-1, -1, -1, -1)  # bad weights
>>>    f <- as.formula(y ~ x)  # captures bad weights with as.formula(env = parent.frame()) default
>>>    return(f)
>>> }
>>> lm(fn(), data = d, weights = w)
>>> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) :
>>> #   variable lengths differ (found for '(weights)')
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

Duncan Murdoch-2
On 09/08/2020 3:07 p.m., Duncan Murdoch wrote:
> On 09/08/2020 3:01 p.m., John Mount wrote:
>> Doesn't this preclude "y ~ ." style notations?
>
> Yes, but you can use "y ~ . - w".

And as was pointed out to me offline, often one doesn't have a simple
vector w giving the weights, instead one computes the weights from the
predictors.  So if weights = f(pred), the original "y ~ ." would be fine.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

John Mount
In reply to this post by Duncan Murdoch-2
I wish I had started with "I am disappointed that lm() doesn't continue its search for weights into the calling environment" or "the fact that lm() looks only in the formula environment and data frame for weights doesn't seem consistent with how other values are treated."

But I did not. So I do apologize for both that and for negative tone on my part.


Simplified example:

d <- data.frame(x = 1:3, y = c(1, 2, 1))
w <- c(1, 10, 1)
f <- as.formula(y ~ x)
lm(f, data = d, weights = w)  # works

# fails
environment(f) <- baseenv()
lm(f, data = d, weights = w)
# Error in eval(extras, data, env) : object 'w' not found


> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
>
> This is fairly clearly documented in ?lm:
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

R devel mailing list
I assume you are concerned about this because the formula is defined
in one environment and the model fitting with weights occurs in a
separate function.  If that is the case then the model fitting
function can create a new environment, a child of the formula's
environment, add the weights variable to it, and make that the new
environment of the formula.  (This new environment is only an
attribute of the copy of the formula in the model fitting function: it
will not affect the formula outside of that function.)  E.g.,


d <- data.frame(x = 1:3, y = c(1, 2, 1))

lmWithWeightsBad <- function(formula, data, weights) {
    lm(formula, data=data, weights=weights)
}
coef(lmWithWeightsBad(y~x, data=d, weights=c(2,5,1))) # lm finds the
'weights' function in package:stats
#Error in model.frame.default(formula = formula, data = data, weights
= weights,  :
#  invalid type (closure) for variable '(weights)'

lmWithWeightsGood <- function(formula, data, weights) {
    envir <- new.env(parent = environment(formula))
    envir$weights <- weights
    environment(formula) <- envir
    lm(formula, data=data, weights=weights)
}
coef(lmWithWeightsGood(y~x, data=d, weights=c(2,5,1)))
#(Intercept)           x
#  1.2173913   0.2173913

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Aug 10, 2020 at 10:43 AM John Mount <[hidden email]> wrote:

>
> I wish I had started with "I am disappointed that lm() doesn't continue its search for weights into the calling environment" or "the fact that lm() looks only in the formula environment and data frame for weights doesn't seem consistent with how other values are treated."
>
> But I did not. So I do apologize for both that and for negative tone on my part.
>
>
> Simplified example:
>
> d <- data.frame(x = 1:3, y = c(1, 2, 1))
> w <- c(1, 10, 1)
> f <- as.formula(y ~ x)
> lm(f, data = d, weights = w)  # works
>
> # fails
> environment(f) <- baseenv()
> lm(f, data = d, weights = w)
> # Error in eval(extras, data, env) : object 'w' not found
>
>
> > On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
> >
> > This is fairly clearly documented in ?lm:
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

Duncan Murdoch-2
In reply to this post by John Mount
On 10/08/2020 1:42 p.m., John Mount wrote:
> I wish I had started with "I am disappointed that lm() doesn't continue its search for weights into the calling environment" or "the fact that lm() looks only in the formula environment and data frame for weights doesn't seem consistent with how other values are treated."

Normally searching is done automatically by following a chain of
environments.  It's easy to add something to the head of the chain (e.g.
data), it's hard to add something in the middle or at the end (because
the chain ends with emptyenv(), which is not allowed to have a parent).

So I'd suggest using

  environment(f) <- environment()

before calling lm() if you want the calling environment to be in the
search.  Setting it to baseenv() doesn't really make sense, unless you
want to disable all searches except in data, in which case emptyenv()
would make more sense (but I haven't tried it, so it might break something).

Duncan Murdoch

>
> But I did not. So I do apologize for both that and for negative tone on my part.
>
>
> Simplified example:
>
> d <- data.frame(x = 1:3, y = c(1, 2, 1))
> w <- c(1, 10, 1)
> f <- as.formula(y ~ x)
> lm(f, data = d, weights = w)  # works
>
> # fails
> environment(f) <- baseenv()
> lm(f, data = d, weights = w)
> # Error in eval(extras, data, env) : object 'w' not found
>
>
>> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
>>
>> This is fairly clearly documented in ?lm:
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

John Mount
Thank you for your suggestion. I do know how to work around the issue.  I usually build a fresh environment as a child of base-environment and then insurt the weights there. I was just trying to provide an example of the issue.

emptyenv() can not be used, as it is needed for the eval (errors out even if weights are not used with "could not find function list").

For some applications one doesn't want the formula to have a non-trivial environment with respect to serialization.  Nina Zumel wrote about reference leaks in lm()/glm() and a good part of that was environments other than global/base (such as those formed when building a formula in a function) capturing references to unrelated structures.



> On Aug 10, 2020, at 11:34 AM, Duncan Murdoch <[hidden email]> wrote:
>
> On 10/08/2020 1:42 p.m., John Mount wrote:
>> I wish I had started with "I am disappointed that lm() doesn't continue its search for weights into the calling environment" or "the fact that lm() looks only in the formula environment and data frame for weights doesn't seem consistent with how other values are treated."
>
> Normally searching is done automatically by following a chain of environments.  It's easy to add something to the head of the chain (e.g. data), it's hard to add something in the middle or at the end (because the chain ends with emptyenv(), which is not allowed to have a parent).
>
> So I'd suggest using
>
> environment(f) <- environment()
>
> before calling lm() if you want the calling environment to be in the search.  Setting it to baseenv() doesn't really make sense, unless you want to disable all searches except in data, in which case emptyenv() would make more sense (but I haven't tried it, so it might break something).
>
> Duncan Murdoch
>
>> But I did not. So I do apologize for both that and for negative tone on my part.
>> Simplified example:
>> d <- data.frame(x = 1:3, y = c(1, 2, 1))
>> w <- c(1, 10, 1)
>> f <- as.formula(y ~ x)
>> lm(f, data = d, weights = w)  # works
>> # fails
>> environment(f) <- baseenv()
>> lm(f, data = d, weights = w)
>> # Error in eval(extras, data, env) : object 'w' not found
>>> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]> wrote:
>>>
>>> This is fairly clearly documented in ?lm:
>>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: lm() takes weights from formula environment

John Mount
Forgot the url: https://win-vector.com/2014/05/30/trimming-the-fat-from-glm-models-in-r/

On Aug 10, 2020, at 11:50 AM, John Mount <[hidden email]<mailto:[hidden email]>> wrote:

Thank you for your suggestion. I do know how to work around the issue.  I usually build a fresh environment as a child of base-environment and then insurt the weights there. I was just trying to provide an example of the issue.

emptyenv() can not be used, as it is needed for the eval (errors out even if weights are not used with "could not find function list").

For some applications one doesn't want the formula to have a non-trivial environment with respect to serialization.  Nina Zumel wrote about reference leaks in lm()/glm() and a good part of that was environments other than global/base (such as those formed when building a formula in a function) capturing references to unrelated structures.



On Aug 10, 2020, at 11:34 AM, Duncan Murdoch <[hidden email]<mailto:[hidden email]>> wrote:

On 10/08/2020 1:42 p.m., John Mount wrote:
I wish I had started with "I am disappointed that lm() doesn't continue its search for weights into the calling environment" or "the fact that lm() looks only in the formula environment and data frame for weights doesn't seem consistent with how other values are treated."

Normally searching is done automatically by following a chain of environments.  It's easy to add something to the head of the chain (e.g. data), it's hard to add something in the middle or at the end (because the chain ends with emptyenv(), which is not allowed to have a parent).

So I'd suggest using

environment(f) <- environment()

before calling lm() if you want the calling environment to be in the search.  Setting it to baseenv() doesn't really make sense, unless you want to disable all searches except in data, in which case emptyenv() would make more sense (but I haven't tried it, so it might break something).

Duncan Murdoch

But I did not. So I do apologize for both that and for negative tone on my part.
Simplified example:
d <- data.frame(x = 1:3, y = c(1, 2, 1))
w <- c(1, 10, 1)
f <- as.formula(y ~ x)
lm(f, data = d, weights = w)  # works
# fails
environment(f) <- baseenv()
lm(f, data = d, weights = w)
# Error in eval(extras, data, env) : object 'w' not found
On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <[hidden email]<mailto:[hidden email]>> wrote:

This is fairly clearly documented in ?lm:





        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel