Interesting behavior of lm() with small, problematic data sets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Interesting behavior of lm() with small, problematic data sets

Glover, Tim-2
I've recently come across the following results reported from the lm() function when applied to a particular type of admittedly difficult data.  When working with
small data sets (for instance 3 points) with the same response for different predicting variable, the resulting slope estimate is a reasonable approximation of the expected 0.0, but the p-value of that slope estimate is a surprising value.  A reproducible example is included below, along with the output of the summary of results

######### example code
x <- c(1,2,3)
y <- c(1,1,1)

#above results in{ (1,1) (2,1) (3,1)} data set to regress

new.rez <- lm (y ~ x) # regress constant y on changing x)
summary(new.rez) # display results of regression

######## end of example code

Results:

Call:
lm(formula = y ~ x)

Residuals:
         1          2          3
 5.906e-17 -1.181e-16  5.906e-17

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)
(Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
x           -1.772e-16  1.023e-16 -1.732e+00    0.333
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.447e-16 on 1 degrees of freedom
Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112

Warning message:
In summary.lm(new.rez) : essentially perfect fit: summary may be unreliable


##############

There is a warning that the summary may be unreliable sue to the essentially perfect fit, but a p-value of 0.3112 doesn’t seem reasonable.
As a side note, the various r^2 values seem odd too.







Tim Glover
Senior Scientist II (Geochemistry, Statistics), Americas - Environment & Infrastructure, Amec Foster Wheeler
271 Mill Road, Chelmsford, Massachusetts, USA 01824-4105
T +01 978 692 9090      D +01 978 392 5383      M +01 850 445 5039
[hidden email]      amecfw.com


This message is the property of Amec Foster Wheeler plc and/or its subsidiaries and/or affiliates and is intended only for the named recipient(s). Its contents (including any attachments) may be confidential, legally privileged or otherwise protected from disclosure by law. Unauthorised use, copying, distribution or disclosure of any of it may be unlawful and is strictly prohibited. We assume no responsibility to persons other than the intended named recipient(s) and do not accept liability for any errors or omissions which are a result of email transmission. If you have received this message in error, please notify us immediately by reply email to the sender and confirm that the original message and any attachments and copies have been destroyed and deleted from your system. If you do not wish to receive future unsolicited commercial electronic messages from us, please forward this email to: [hidden email] and include “Unsubscribe” in the subject line. If applicable, you will continue to receive invoices, project communications and similar factual, non-commercial electronic communications.

Please click http://amecfw.com/email-disclaimer for notices and company information in relation to emails originating in the UK, Italy or France.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

Jeff Newmiller
Why does an unreliable fit have to provide "reasonable" results?

More specifically, p-values arise from observed distributions... if your slopes are "in the noise" then the slope estimate's location within that distribution could be anywhere relative to the center and spread of that very narrow distribution, leading to, ah, what was it... oh, right... "unreliable" results.
--
Sent from my phone. Please excuse my brevity.

On September 5, 2017 6:24:30 AM PDT, "Glover, Tim" <[hidden email]> wrote:

>I've recently come across the following results reported from the lm()
>function when applied to a particular type of admittedly difficult
>data.  When working with
>small data sets (for instance 3 points) with the same response for
>different predicting variable, the resulting slope estimate is a
>reasonable approximation of the expected 0.0, but the p-value of that
>slope estimate is a surprising value.  A reproducible example is
>included below, along with the output of the summary of results
>
>######### example code
>x <- c(1,2,3)
>y <- c(1,1,1)
>
>#above results in{ (1,1) (2,1) (3,1)} data set to regress
>
>new.rez <- lm (y ~ x) # regress constant y on changing x)
>summary(new.rez) # display results of regression
>
>######## end of example code
>
>Results:
>
>Call:
>lm(formula = y ~ x)
>
>Residuals:
>         1          2          3
> 5.906e-17 -1.181e-16  5.906e-17
>
>Coefficients:
>              Estimate Std. Error    t value Pr(>|t|)
>(Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
>x           -1.772e-16  1.023e-16 -1.732e+00    0.333
>---
>Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>Residual standard error: 1.447e-16 on 1 degrees of freedom
>Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
>F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112
>
>Warning message:
>In summary.lm(new.rez) : essentially perfect fit: summary may be
>unreliable
>
>
>##############
>
>There is a warning that the summary may be unreliable sue to the
>essentially perfect fit, but a p-value of 0.3112 doesn’t seem
>reasonable.
>As a side note, the various r^2 values seem odd too.
>
>
>
>
>
>
>
>Tim Glover
>Senior Scientist II (Geochemistry, Statistics), Americas - Environment
>& Infrastructure, Amec Foster Wheeler
>271 Mill Road, Chelmsford, Massachusetts, USA 01824-4105
>T +01 978 692 9090      D +01 978 392 5383      M +01 850 445 5039
>[hidden email]      amecfw.com
>
>
>This message is the property of Amec Foster Wheeler plc and/or its
>subsidiaries and/or affiliates and is intended only for the named
>recipient(s). Its contents (including any attachments) may be
>confidential, legally privileged or otherwise protected from disclosure
>by law. Unauthorised use, copying, distribution or disclosure of any of
>it may be unlawful and is strictly prohibited. We assume no
>responsibility to persons other than the intended named recipient(s)
>and do not accept liability for any errors or omissions which are a
>result of email transmission. If you have received this message in
>error, please notify us immediately by reply email to the sender and
>confirm that the original message and any attachments and copies have
>been destroyed and deleted from your system. If you do not wish to
>receive future unsolicited commercial electronic messages from us,
>please forward this email to: [hidden email] and include
>“Unsubscribe” in the subject line. If applicable, you will continue to
>receive invoices, project communications and similar factual,
>non-commercial electronic communications.
>
>Please click http://amecfw.com/email-disclaimer for notices and company
>information in relation to emails originating in the UK, Italy or
>France.
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

David Winsemius
In reply to this post by Glover, Tim-2

> On Sep 5, 2017, at 6:24 AM, Glover, Tim <[hidden email]> wrote:
>
> I've recently come across the following results reported from the lm() function when applied to a particular type of admittedly difficult data.  When working with
> small data sets (for instance 3 points) with the same response for different predicting variable, the resulting slope estimate is a reasonable approximation of the expected 0.0, but the p-value of that slope estimate is a surprising value.  A reproducible example is included below, along with the output of the summary of results
>
> ######### example code
> x <- c(1,2,3)
> y <- c(1,1,1)
>
> #above results in{ (1,1) (2,1) (3,1)} data set to regress
>
> new.rez <- lm (y ~ x) # regress constant y on changing x)
> summary(new.rez) # display results of regression
>
> ######## end of example code
>
> Results:
>
> Call:
> lm(formula = y ~ x)
>
> Residuals:
>         1          2          3
> 5.906e-17 -1.181e-16  5.906e-17
>
> Coefficients:
>              Estimate Std. Error    t value Pr(>|t|)
> (Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
> x           -1.772e-16  1.023e-16 -1.732e+00    0.333
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 1.447e-16 on 1 degrees of freedom
> Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
> F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112
>
> Warning message:
> In summary.lm(new.rez) : essentially perfect fit: summary may be unreliable
>
>
> ##############
>
> There is a warning that the summary may be unreliable sue to the essentially perfect fit, but a p-value of 0.3112 doesn’t seem reasonable.
> As a side note, the various r^2 values seem odd too.

You have an overfitted model with only 3 perfectly fit-able data points and you are whinging about a Wald statistic about which you were warned. I think you are wasting our time. (But I'm fully retired and I have a lot of time to waste.)

I seem to remember that a t-distribution with 1 degree of freedom is actually the Cauchy distribution. I would point out that you can also get:

> 2*pt(-1.732e+00, 1)
[1] 0.3333414

So maybe from that perspective any value might be "reasonable" from the perspective that you have that particular number data points (so one degree of freedom) and are using an estimate of the t-statistic which is essentially the ratio of 0/0 from a numerical point of view.

--
David.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

m.pi.R
In reply to this post by Glover, Tim-2
Tim,

I think what you're seeing is
https://en.wikipedia.org/wiki/Loss_of_significance.

Cheers,

Mark



From:   "Glover, Tim" <[hidden email]>
To:     "[hidden email]" <[hidden email]>
Date:   09/05/2017 11:37 AM
Subject:        [R] Interesting behavior of lm() with small, problematic
data sets
Sent by:        "R-help" <[hidden email]>



I've recently come across the following results reported from the lm()
function when applied to a particular type of admittedly difficult data.
When working with
small data sets (for instance 3 points) with the same response for
different predicting variable, the resulting slope estimate is a
reasonable approximation of the expected 0.0, but the p-value of that
slope estimate is a surprising value.  A reproducible example is included
below, along with the output of the summary of results

######### example code
x <- c(1,2,3)
y <- c(1,1,1)

#above results in{ (1,1) (2,1) (3,1)} data set to regress

new.rez <- lm (y ~ x) # regress constant y on changing x)
summary(new.rez) # display results of regression

######## end of example code

Results:

Call:
lm(formula = y ~ x)

Residuals:
         1          2          3
 5.906e-17 -1.181e-16  5.906e-17

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)
(Intercept)  1.000e+00  2.210e-16  4.525e+15   <2e-16 ***
x           -1.772e-16  1.023e-16 -1.732e+00    0.333
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 1.447e-16 on 1 degrees of freedom
Multiple R-squared:  0.7794,    Adjusted R-squared:  0.5589
F-statistic: 3.534 on 1 and 1 DF,  p-value: 0.3112

Warning message:
In summary.lm(new.rez) : essentially perfect fit: summary may be
unreliable


##############

There is a warning that the summary may be unreliable sue to the
essentially perfect fit, but a p-value of 0.3112 doesn?t seem reasonable.
As a side note, the various r^2 values seem odd too.







Tim Glover
Senior Scientist II (Geochemistry, Statistics), Americas - Environment &
Infrastructure, Amec Foster Wheeler
271 Mill Road, Chelmsford, Massachusetts, USA 01824-4105
T +01 978 692 9090      D +01 978 392 5383      M +01 850 445 5039
[hidden email]      amecfw.com


This message is the property of Amec Foster Wheeler plc and/or its
subsidiaries and/or affiliates and is intended only for the named
recipient(s). Its contents (including any attachments) may be
confidential, legally privileged or otherwise protected from disclosure by
law. Unauthorised use, copying, distribution or disclosure of any of it
may be unlawful and is strictly prohibited. We assume no responsibility to
persons other than the intended named recipient(s) and do not accept
liability for any errors or omissions which are a result of email
transmission. If you have received this message in error, please notify us
immediately by reply email to the sender and confirm that the original
message and any attachments and copies have been destroyed and deleted
from your system. If you do not wish to receive future unsolicited
commercial electronic messages from us, please forward this email to:
[hidden email] and include ?Unsubscribe? in the subject line. If
applicable, you will continue to receive invoices, project communications
and similar factual, non-commercial electronic communications.

Please click http://amecfw.com/email-disclaimer for notices and company
information in relation to emails originating in the UK, Italy or France.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

S Ellison-2
> I think what you're seeing is
> https://en.wikipedia.org/wiki/Loss_of_significance.

Almost.
All the results in the OP's summary are reflections of finite precision in the analytically exact solution, leading to residuals smaller than the double precision limit. The summary is correctly warning that it's all potentially nonsense, and indeed the only things you can trust are the coefficient values (to within .Machine$double.eps or thereabouts)

Interestingly, though, my current version of R (3.4.0) gives numerically exact coefficients (c(1,0) and identically zero standard errors.

So this particular example is apparently version-specific.

S Ellison


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
JRG
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

JRG
Indeed (version-specific).

With R 3.4.1 on linux, I get coefficients and residuals that are
numerically exact, F-statistic = NaN, p-value = NA, R-squared = NaN, etc.

All of which is what ought to happen, given that the response variable
(y) is not actually variable.


---JRG
John R. Gleason


On 09/06/2017 09:10 AM, S Ellison wrote:

>> I think what you're seeing is
>> https://en.wikipedia.org/wiki/Loss_of_significance.
>
> Almost.
> All the results in the OP's summary are reflections of finite precision in the analytically exact solution, leading to residuals smaller than the double precision limit. The summary is correctly warning that it's all potentially nonsense, and indeed the only things you can trust are the coefficient values (to within .Machine$double.eps or thereabouts)
>
> Interestingly, though, my current version of R (3.4.0) gives numerically exact coefficients (c(1,0) and identically zero standard errors.
>
> So this particular example is apparently version-specific.
>
> S Ellison
>
>
> *******************************************************************
> This email and any attachments are confidential. Any use...{{dropped:8}}
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Interesting behavior of lm() with small, problematic data sets

Rainer Krug-2
Same version on Mac, same results.


> On 6 Sep 2017, at 15:22, JRG <[hidden email]> wrote:
>
> Indeed (version-specific).
>
> With R 3.4.1 on linux, I get coefficients and residuals that are
> numerically exact, F-statistic = NaN, p-value = NA, R-squared = NaN, etc.
>
> All of which is what ought to happen, given that the response variable
> (y) is not actually variable.
>
>
> ---JRG
> John R. Gleason
>
>
> On 09/06/2017 09:10 AM, S Ellison wrote:
>>> I think what you're seeing is
>>> https://en.wikipedia.org/wiki/Loss_of_significance.
>>
>> Almost.
>> All the results in the OP's summary are reflections of finite precision in the analytically exact solution, leading to residuals smaller than the double precision limit. The summary is correctly warning that it's all potentially nonsense, and indeed the only things you can trust are the coefficient values (to within .Machine$double.eps or thereabouts)
>>
>> Interestingly, though, my current version of R (3.4.0) gives numerically exact coefficients (c(1,0) and identically zero standard errors.
>>
>> So this particular example is apparently version-specific.
>>
>> S Ellison
>>
>>
>> *******************************************************************
>> This email and any attachments are confidential. Any use...{{dropped:8}}
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.