stats::lm has inconsistent output when adding constant to dependent variable

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

stats::lm has inconsistent output when adding constant to dependent variable

David J. Birke
Dear R community,

I just stumbled upon the following behavior in R version 3.6.0:

set.seed(42)
y <- rep(0, 30)
x <- rbinom(30, 1, prob = 0.91)
# The following will not show any t-statistic or p-value
summary(lm(y~x))
#  The following will show t-statistic and p-value
summary(lm(1+y~x))

My expected output is that the first case should report t-statistic and
p-value. My intuition might be tricking me, but I think that a constant
shift of the data should be fully absorbed by the constant and not
affect inference about the slope.

Is this a bug or is there a reason why there should be a discrepancy
between the two outputs?

Best,
David

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: stats::lm has inconsistent output when adding constant to dependent variable

mark leeds
Hi: In your example, you made the response zero in every case which
is going to cause problems.  In glm's, I think they call it the donsker
effect. I'm not sure what it's called
in OLS. probably a lack of identifiability. Note that you probably
shouldn't be using zeros
and 1's as the response in a regression anyway.

If you change the response to below, you get what  you'd expect.

y <- c(rep(0, 15), rep(1,15))

On Fri, Sep 27, 2019 at 1:48 PM David J. Birke <[hidden email]> wrote:

> Dear R community,
>
> I just stumbled upon the following behavior in R version 3.6.0:
>
> set.seed(42)
> y <- rep(0, 30)
> x <- rbinom(30, 1, prob = 0.91)
> # The following will not show any t-statistic or p-value
> summary(lm(y~x))
> #  The following will show t-statistic and p-value
> summary(lm(1+y~x))
>
> My expected output is that the first case should report t-statistic and
> p-value. My intuition might be tricking me, but I think that a constant
> shift of the data should be fully absorbed by the constant and not
> affect inference about the slope.
>
> Is this a bug or is there a reason why there should be a discrepancy
> between the two outputs?
>
> Best,
> David
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: stats::lm has inconsistent output when adding constant to dependent variable

mark leeds
correction to my previous answer. I looked around and I don't think it's
called the donsker effect. It seems to
jbe referred to as just  a case of "perfect separability.". if you google
for" perfect separation in glms", you'll get a
lot of information.






On Fri, Sep 27, 2019 at 2:35 PM Mark Leeds <[hidden email]> wrote:

> Hi: In your example, you made the response zero in every case which
> is going to cause problems.  In glm's, I think they call it the donsker
> effect. I'm not sure what it's called
> in OLS. probably a lack of identifiability. Note that you probably
> shouldn't be using zeros
> and 1's as the response in a regression anyway.
>
> If you change the response to below, you get what  you'd expect.
>
> y <- c(rep(0, 15), rep(1,15))
>
> On Fri, Sep 27, 2019 at 1:48 PM David J. Birke <[hidden email]>
> wrote:
>
>> Dear R community,
>>
>> I just stumbled upon the following behavior in R version 3.6.0:
>>
>> set.seed(42)
>> y <- rep(0, 30)
>> x <- rbinom(30, 1, prob = 0.91)
>> # The following will not show any t-statistic or p-value
>> summary(lm(y~x))
>> #  The following will show t-statistic and p-value
>> summary(lm(1+y~x))
>>
>> My expected output is that the first case should report t-statistic and
>> p-value. My intuition might be tricking me, but I think that a constant
>> shift of the data should be fully absorbed by the constant and not
>> affect inference about the slope.
>>
>> Is this a bug or is there a reason why there should be a discrepancy
>> between the two outputs?
>>
>> Best,
>> David
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: stats::lm has inconsistent output when adding constant to dependent variable

Rui Barradas
In reply to this post by David J. Birke
Hello,

Maybe FAQ 7.31?

Check the residuals, they are all "zero" in both cases:

fit0 <- lm(y~x)
fit1 <- lm(1+y~x)

# residuals
table(resid(fit0))
#
# 0
#30

table(resid(fit1))
#
#-5.21223595241838e-16 -4.93038065763132e-31  3.12734157145103e-15
#                    6                    23                     1


Hope this helps,

Rui Barradas

Às 18:05 de 27/09/19, David J. Birke escreveu:

> Dear R community,
>
> I just stumbled upon the following behavior in R version 3.6.0:
>
> set.seed(42)
> y <- rep(0, 30)
> x <- rbinom(30, 1, prob = 0.91)
> # The following will not show any t-statistic or p-value
> summary(lm(y~x))
> #  The following will show t-statistic and p-value
> summary(lm(1+y~x))
>
> My expected output is that the first case should report t-statistic and
> p-value. My intuition might be tricking me, but I think that a constant
> shift of the data should be fully absorbed by the constant and not
> affect inference about the slope.
>
> Is this a bug or is there a reason why there should be a discrepancy
> between the two outputs?
>
> Best,
> David
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: stats::lm has inconsistent output when adding constant to dependent variable

mark leeds
In reply to this post by mark leeds
Hi Berwin: Yes, that's it. Donsker is famous for a functional CLT so I was
mixing up  statistics
and stochastic processes  I better stick to statistics. It's safer. !!!!!
Thanks for correction.
I'm ccing R-help since it may be useful to someone there. See below for
Berwin's
comment.


Mark

On Sat, Sep 28, 2019 at 3:36 AM Berwin A Turlach <[hidden email]>
wrote:

> G'day Mark,
>
> On Fri, 27 Sep 2019 14:43:28 -0400
> Mark Leeds <[hidden email]> wrote:
>
> > correction to my previous answer. I looked around and I don't think
> > it's called the donsker effect.
>
> I think you meant the Hauck-Donner effect [1], which refers to the
> problem of separation for binomial GLMs (not all GLMs).
>
> Cheers,
>
>         Berwin
>
> [1] Hauck, Jr., W.W. and Donner, A. (1977) Wald's test as applied to
> hypotheses in logit analysis.  Journal of the American Statistical
> Association 72, 851-853.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: stats::lm has inconsistent output when adding constant to dependent variable

Ales Ziberna
In reply to this post by Rui Barradas
In one case they are exactly 0 and in the other they are almost zero. This
is the reason for different results.

Of course, they should be exactly the same, but this is due to some integer
values not being exactly represented as real values on binary computers.

Best,
Aleš Žiberna

On Fri, Sep 27, 2019 at 9:01 PM Rui Barradas <[hidden email]> wrote:

> Hello,
>
> Maybe FAQ 7.31?
>
> Check the residuals, they are all "zero" in both cases:
>
> fit0 <- lm(y~x)
> fit1 <- lm(1+y~x)
>
> # residuals
> table(resid(fit0))
> #
> # 0
> #30
>
> table(resid(fit1))
> #
> #-5.21223595241838e-16 -4.93038065763132e-31  3.12734157145103e-15
> #                    6                    23                     1
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 18:05 de 27/09/19, David J. Birke escreveu:
> > Dear R community,
> >
> > I just stumbled upon the following behavior in R version 3.6.0:
> >
> > set.seed(42)
> > y <- rep(0, 30)
> > x <- rbinom(30, 1, prob = 0.91)
> > # The following will not show any t-statistic or p-value
> > summary(lm(y~x))
> > #  The following will show t-statistic and p-value
> > summary(lm(1+y~x))
> >
> > My expected output is that the first case should report t-statistic and
> > p-value. My intuition might be tricking me, but I think that a constant
> > shift of the data should be fully absorbed by the constant and not
> > affect inference about the slope.
> >
> > Is this a bug or is there a reason why there should be a discrepancy
> > between the two outputs?
> >
> > Best,
> > David
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.