In addition, there are 'theoretical' reasons for excluding intercept

from the model that must be considered. The reasons related to the

regressor(s) and depend on the phenomenon being modelled. For example,

whereas the intercept can be excluded in a bivariate model on the

expenditure of an individual as determined by income, the same would be

senseless when applying the same model to the population.

Hope this helps a little

Lexi

-----Original Message-----

From:

[hidden email] [mailto:

[hidden email]]

On Behalf Of Dennis Murphy

Sent: Tuesday, July 20, 2010 12:34 PM

To: StatWM

Cc:

[hidden email]
Subject: Re: [R] Correct statistical inference for linear regression

modelswithout intercept in R

Hi:

On Tue, Jul 20, 2010 at 2:41 AM, StatWM <

[hidden email]> wrote:

>

> Dear R community,

>

> is there a way to get correct t- and p-values and R squared for linear

> regression models specified without an intercept?

>

> example model:

> summary(lm(y ~ 0 + x))

>

> This gives too low p-values and too high R squared. Is there a way to

> correct it? Or should I specify with intercept to get the correct

values?

>

How do you know that the p-value is too low and R^2 is too high? Too low

or too high compared to what? You've constrained the intercept of the

model to pass through zero, which affects several features of a simple

linear regression model. For example, sum the residuals from your

no-intercept model - I'll bet they don't add to zero. Do you think that

might affect a few things? Here's an example:

# Generate some data; notice that the true y-intercept is 2 and the true

slope is 2 dd <- data.frame(x = 1:10, y = 2 + 2 * 1:10 + rnorm(10))

plot(y ~ x, data = dd, xlim = c(0, 10), ylim = c(0, 25))

m1 <- lm(y ~ x, data = dd)

abline(coef(m1))

m2 <- lm(y ~ x + 0, data = dd)

abline(c(0, coef(m2)), lty = 'dotted')

# As you noted, the no-intercept model has a higher R^2, # even though

the 'usual' simple linear regression (SLR) # model provided a better

visual fit. Why?

summary(m1)$r.squared

[1] 0.982328

summary(m2)$r.squared

[1] 0.9946863

# The p-value for the F-test on the slope is higher in the #

no-intercept model is lower than in the SLR model. Why?

anova(m1)

Analysis of Variance Table

Response: y

Df Sum Sq Mean Sq F value Pr(>F)

x 1 385.22 385.22 444.69 2.686e-08 ***

Residuals 8 6.93 0.87

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m2)

Analysis of Variance Table

Response: y

Df Sum Sq Mean Sq F value Pr(>F)

x 1 2164.07 2164.07 1684.7 1.507e-11 ***

Residuals 9 11.56 1.28

Look at the differences in sums of squares between the two models, both

in terms of model SS and error SS. What is responsible for those

differences?

Once you understand that, it becomes clear why the apparent anomalies in

R^2 and in the F-test occur by applying the definitions. Also try

sum(m1$resid)

sum(m2$resid)

Why is there a difference? Why dies m2$resid not have to sum to zero?

(Hint: The output in each case is correct, so it's not an R problem. You

need to derive the differences among the various quantities in

regression modeling between the intercept and no-intercept models to

understand the

paradox.)

HTH,

Dennis

Thank you in advance!

[[alternative HTML version deleted]]

DISCLAIMER:\ Sample Disclaimer added in a VBScript.\ ...{{dropped:3}}

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.