Correct statistical inference for linear regression models without intercept in R

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Correct statistical inference for linear regression models without intercept in R

StatWM
Dear R community,

is there a way to get correct t- and p-values and R squared for linear regression models specified without an intercept?

example model:
summary(lm(y ~ 0 + x))

This gives too low p-values and too high R squared. Is there a way to correct it? Or should I specify with intercept to get the correct values?

Thank you in advance!

Wojtek Musial
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression models without intercept in R

Arun.stat
What x and y represent? Are they non-stationary, trending? then you would get very high R2 (~97-99%) and very low p-value. Perhaps you land on the world of spurious regression.

In this case forcing intercept to zero would not help you. Work with differenced series instead raw data.

Thanks and regards,
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression models without intercept in R

StatWM
Let's assume x and y as stationary. It's not a spurious regression problem here. I think the function lm() has to have an intercept to give correct values of t- and p- and R squared. I wonder if you can correct the values in R though?
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression models without intercept in R

djmuseR
In reply to this post by StatWM
Hi:

On Tue, Jul 20, 2010 at 2:41 AM, StatWM <[hidden email]> wrote:

>
> Dear R community,
>
> is there a way to get correct t- and p-values and R squared for linear
> regression models specified without an intercept?
>
> example model:
> summary(lm(y ~ 0 + x))
>
> This gives too low p-values and too high R squared. Is there a way to
> correct it? Or should I specify with intercept to get the correct values?
>
How do you know that the p-value is too low and R^2 is too high? Too low or
too high compared to what? You've constrained the intercept of the model to
pass through zero, which affects several features of a simple linear
regression model. For example, sum the residuals from your no-intercept
model - I'll bet they don't add to zero. Do you think that might affect  a
few things? Here's an example:

# Generate some data; notice that the true y-intercept is 2 and the true
slope is 2
dd <- data.frame(x = 1:10, y = 2 + 2 * 1:10 + rnorm(10))
plot(y ~ x, data = dd, xlim = c(0, 10), ylim = c(0, 25))
m1 <- lm(y ~ x, data = dd)
abline(coef(m1))
m2 <- lm(y ~ x + 0, data = dd)
abline(c(0, coef(m2)), lty = 'dotted')

# As you noted, the no-intercept model has a higher R^2,
# even though the 'usual' simple linear regression (SLR)
# model provided a better visual fit. Why?
summary(m1)$r.squared
[1] 0.982328
summary(m2)$r.squared
[1] 0.9946863

# The p-value for the F-test on the slope is higher in the
# no-intercept model is lower than in the SLR model. Why?
anova(m1)

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)
x          1 385.22  385.22  444.69 2.686e-08 ***
Residuals  8   6.93    0.87
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

anova(m2)

Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)
x          1 2164.07 2164.07  1684.7 1.507e-11 ***
Residuals  9   11.56    1.28


Look at the differences in sums of squares between the two models, both in
terms of model SS and error SS. What is responsible for those differences?
Once you understand that, it becomes clear why the apparent anomalies in R^2
and in the F-test occur by applying the definitions. Also try

sum(m1$resid)
sum(m2$resid)

Why is there a difference? Why dies m2$resid not have to sum to zero?

(Hint: The output in each case is correct, so it's not an R problem. You
need to derive the differences among the various quantities in regression
modeling between the intercept and no-intercept models to understand the
paradox.)

HTH,
Dennis

Thank you in advance!

>
> Wojtek Musial
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Correct-statistical-inference-for-linear-regression-models-without-intercept-in-R-tp2295193p2295193.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression models without intercept in R

Peter Dalgaard-2
In reply to this post by StatWM

On Jul 20, 2010, at 11:41 AM, StatWM wrote:

>
> Dear R community,
>
> is there a way to get correct t- and p-values and R squared for linear
> regression models specified without an intercept?
>
> example model:
> summary(lm(y ~ 0 + x))
>
> This gives too low p-values and too high R squared. Is there a way to
> correct it? Or should I specify with intercept to get the correct values?

They are already correct. If you want incorrect ones, please specify their definition...


> Thank you in advance!
>
> Wojtek Musial
> --
> View this message in context: http://r.789695.n4.nabble.com/Correct-statistical-inference-for-linear-regression-models-without-intercept-in-R-tp2295193p2295193.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression models without intercept in R

StatWM
In reply to this post by djmuseR
Thank you very much for your effort!

But is there a measure, which can compare the goodness of fit of regression models with and without the intercept? Can I only compare them in terms of sum of squares residual?
Reply | Threaded
Open this post in threaded view
|

Re: Correct statistical inference for linear regression modelswithout intercept in R

Setlhare Lekgatlhamang
In reply to this post by djmuseR
In addition, there are 'theoretical' reasons for excluding intercept
from the model that must be considered. The reasons related to the
regressor(s) and depend on the phenomenon being modelled. For example,
whereas the intercept can be excluded in a bivariate model on the
expenditure of an individual as determined by income, the same would be
senseless when applying the same model to the population.

Hope this helps a little

Lexi

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of Dennis Murphy
Sent: Tuesday, July 20, 2010 12:34 PM
To: StatWM
Cc: [hidden email]
Subject: Re: [R] Correct statistical inference for linear regression
modelswithout intercept in R

Hi:

On Tue, Jul 20, 2010 at 2:41 AM, StatWM <[hidden email]> wrote:

>
> Dear R community,
>
> is there a way to get correct t- and p-values and R squared for linear

> regression models specified without an intercept?
>
> example model:
> summary(lm(y ~ 0 + x))
>
> This gives too low p-values and too high R squared. Is there a way to
> correct it? Or should I specify with intercept to get the correct
values?
>

How do you know that the p-value is too low and R^2 is too high? Too low
or too high compared to what? You've constrained the intercept of the
model to pass through zero, which affects several features of a simple
linear regression model. For example, sum the residuals from your
no-intercept model - I'll bet they don't add to zero. Do you think that
might affect  a few things? Here's an example:

# Generate some data; notice that the true y-intercept is 2 and the true
slope is 2 dd <- data.frame(x = 1:10, y = 2 + 2 * 1:10 + rnorm(10))
plot(y ~ x, data = dd, xlim = c(0, 10), ylim = c(0, 25))
m1 <- lm(y ~ x, data = dd)
abline(coef(m1))
m2 <- lm(y ~ x + 0, data = dd)
abline(c(0, coef(m2)), lty = 'dotted')

# As you noted, the no-intercept model has a higher R^2, # even though
the 'usual' simple linear regression (SLR) # model provided a better
visual fit. Why?
summary(m1)$r.squared
[1] 0.982328
summary(m2)$r.squared
[1] 0.9946863

# The p-value for the F-test on the slope is higher in the #
no-intercept model is lower than in the SLR model. Why?
anova(m1)

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)
x          1 385.22  385.22  444.69 2.686e-08 ***
Residuals  8   6.93    0.87
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m2)

Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)
x          1 2164.07 2164.07  1684.7 1.507e-11 ***
Residuals  9   11.56    1.28


Look at the differences in sums of squares between the two models, both
in terms of model SS and error SS. What is responsible for those
differences?
Once you understand that, it becomes clear why the apparent anomalies in
R^2 and in the F-test occur by applying the definitions. Also try

sum(m1$resid)
sum(m2$resid)

Why is there a difference? Why dies m2$resid not have to sum to zero?

(Hint: The output in each case is correct, so it's not an R problem. You
need to derive the differences among the various quantities in
regression modeling between the intercept and no-intercept models to
understand the
paradox.)

HTH,
Dennis

Thank you in advance!

>
> Wojtek Musial
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Correct-statistical-inference-for-linear
> -regression-models-without-intercept-in-R-tp2295193p2295193.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]



DISCLAIMER:\ Sample Disclaimer added in a VBScript.\ ...{{dropped:3}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.