Query on R-squared correlation coefficient for linear regression through origin

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Query on R-squared correlation coefficient for linear regression through origin

Patrick Barrie
I have a query on the R-squared correlation coefficient for linear
regression through the origin.

The general expression for R-squared in regression (whether linear or
non-linear) is
R-squared = 1 - sum(y-ypredicted)^2 / sum(y-ybar)^2

However, the lm function within R does not seem to use this expression
when the intercept is constrained to be zero. It gives results different
to Excel and other data analysis packages.

As an example (using built-in cars dataframe):
>  cars.lm=lm(dist ~ 0+speed, data=cars)     # linear regression through
origin
> summary(cars.lm)$r.squared # report R-squared [1] 0.8962893 >
1-deviance(cars.lm)/sum((cars$dist-mean(cars$dist))^2)     # calculates
R-squared directly [1] 0.6018997 > # The latter corresponds to the value
reported by Excel (and other data analysis packages) > > # Note that we
expect R-squared to be smaller for linear regression through the origin
 > # than for linear regression without a constraint (which is 0.6511 in
this example)

Does anyone know what R is doing in this case? Is there an option to get
R to return what I termed the "general" expression for R-squared? The
adjusted R-squared value is also affected. [Other parameters all seem
correct.]

Thanks for any help on this issue,

Patrick

P.S. I believe old versions of Excel (before 2003) also had this issue.

--
Dr Patrick J. Barrie
Department of Chemical Engineering and Biotechnology
University of Cambridge
Philippa Fawcett Drive, Cambridge CB3 0AS
01223 331864
[hidden email]


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Query on R-squared correlation coefficient for linear regression through origin

J C Nash
This issue that traces back to the very unfortunate use
of R-squared as the name of a tool to simply compare a model to the model that
is a single number (the mean). The mean can be shown to be the optimal choice
for a model that is a single number, so it makes sense to try to do better.

The OP has the correct form -- and I find no matter what the software, when
working with models that do NOT have a constant in them (i.e., nonlinear
models, regression through the origin) it pays to do the calculation
"manually". In R it is really easy to write the necessary function, so
why take a chance that a software developer has tried to expand the concept
using a personal choice that is beyond a clear definition.

I've commented elsewhere that I use this statistic even for nonlinear
models in my own software, since I think one should do better than the
mean for a model, but other workers shy away from using it for nonlinear
models because there may be false interpretation based on its use for
linear models.

JN


On 2018-09-27 06:56 AM, Patrick Barrie wrote:

> I have a query on the R-squared correlation coefficient for linear
> regression through the origin.
>
> The general expression for R-squared in regression (whether linear or
> non-linear) is
> R-squared = 1 - sum(y-ypredicted)^2 / sum(y-ybar)^2
>
> However, the lm function within R does not seem to use this expression
> when the intercept is constrained to be zero. It gives results different
> to Excel and other data analysis packages.
>
> As an example (using built-in cars dataframe):
>>  cars.lm=lm(dist ~ 0+speed, data=cars)     # linear regression through
> origin
>> summary(cars.lm)$r.squared # report R-squared [1] 0.8962893 >
> 1-deviance(cars.lm)/sum((cars$dist-mean(cars$dist))^2)     # calculates
> R-squared directly [1] 0.6018997 > # The latter corresponds to the value
> reported by Excel (and other data analysis packages) > > # Note that we
> expect R-squared to be smaller for linear regression through the origin
>  > # than for linear regression without a constraint (which is 0.6511 in
> this example)
>
> Does anyone know what R is doing in this case? Is there an option to get
> R to return what I termed the "general" expression for R-squared? The
> adjusted R-squared value is also affected. [Other parameters all seem
> correct.]
>
> Thanks for any help on this issue,
>
> Patrick
>
> P.S. I believe old versions of Excel (before 2003) also had this issue.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Query on R-squared correlation coefficient for linear regression through origin

Eric Berger
See also this thread in stats.stackexchange

https://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-increases-r2-in-linear-mo



On Thu, Sep 27, 2018 at 3:43 PM, J C Nash <[hidden email]> wrote:

> This issue that traces back to the very unfortunate use
> of R-squared as the name of a tool to simply compare a model to the model
> that
> is a single number (the mean). The mean can be shown to be the optimal
> choice
> for a model that is a single number, so it makes sense to try to do better.
>
> The OP has the correct form -- and I find no matter what the software, when
> working with models that do NOT have a constant in them (i.e., nonlinear
> models, regression through the origin) it pays to do the calculation
> "manually". In R it is really easy to write the necessary function, so
> why take a chance that a software developer has tried to expand the concept
> using a personal choice that is beyond a clear definition.
>
> I've commented elsewhere that I use this statistic even for nonlinear
> models in my own software, since I think one should do better than the
> mean for a model, but other workers shy away from using it for nonlinear
> models because there may be false interpretation based on its use for
> linear models.
>
> JN
>
>
> On 2018-09-27 06:56 AM, Patrick Barrie wrote:
> > I have a query on the R-squared correlation coefficient for linear
> > regression through the origin.
> >
> > The general expression for R-squared in regression (whether linear or
> > non-linear) is
> > R-squared = 1 - sum(y-ypredicted)^2 / sum(y-ybar)^2
> >
> > However, the lm function within R does not seem to use this expression
> > when the intercept is constrained to be zero. It gives results different
> > to Excel and other data analysis packages.
> >
> > As an example (using built-in cars dataframe):
> >>  cars.lm=lm(dist ~ 0+speed, data=cars)     # linear regression through
> > origin
> >> summary(cars.lm)$r.squared # report R-squared [1] 0.8962893 >
> > 1-deviance(cars.lm)/sum((cars$dist-mean(cars$dist))^2)     # calculates
> > R-squared directly [1] 0.6018997 > # The latter corresponds to the value
> > reported by Excel (and other data analysis packages) > > # Note that we
> > expect R-squared to be smaller for linear regression through the origin
> >  > # than for linear regression without a constraint (which is 0.6511 in
> > this example)
> >
> > Does anyone know what R is doing in this case? Is there an option to get
> > R to return what I termed the "general" expression for R-squared? The
> > adjusted R-squared value is also affected. [Other parameters all seem
> > correct.]
> >
> > Thanks for any help on this issue,
> >
> > Patrick
> >
> > P.S. I believe old versions of Excel (before 2003) also had this issue.
> >
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Query on R-squared correlation coefficient for linear regression through origin

Peter Dalgaard-2
In reply to this post by Patrick Barrie
This is an old discussion. The thing that R is doing is to compare the model to the model without any regressors, which in the no-intercept case is the constant zero. Otherwise, you would be comparing non-nested models and the R^2 would not satisfy the property of being between 0 and 1.

A similar issue affects anova tables, where the regression sum of squares is sum(yhat^2) rather than sum((yhat - ybar)^2).

-pd

> On 27 Sep 2018, at 12:56 , Patrick Barrie <[hidden email]> wrote:
>
> I have a query on the R-squared correlation coefficient for linear
> regression through the origin.
>
> The general expression for R-squared in regression (whether linear or
> non-linear) is
> R-squared = 1 - sum(y-ypredicted)^2 / sum(y-ybar)^2
>
> However, the lm function within R does not seem to use this expression
> when the intercept is constrained to be zero. It gives results different
> to Excel and other data analysis packages.
>
> As an example (using built-in cars dataframe):
>> cars.lm=lm(dist ~ 0+speed, data=cars)     # linear regression through
> origin
>> summary(cars.lm)$r.squared # report R-squared [1] 0.8962893 >
> 1-deviance(cars.lm)/sum((cars$dist-mean(cars$dist))^2)     # calculates
> R-squared directly [1] 0.6018997 > # The latter corresponds to the value
> reported by Excel (and other data analysis packages) > > # Note that we
> expect R-squared to be smaller for linear regression through the origin
>> # than for linear regression without a constraint (which is 0.6511 in
> this example)
>
> Does anyone know what R is doing in this case? Is there an option to get
> R to return what I termed the "general" expression for R-squared? The
> adjusted R-squared value is also affected. [Other parameters all seem
> correct.]
>
> Thanks for any help on this issue,
>
> Patrick
>
> P.S. I believe old versions of Excel (before 2003) also had this issue.
>
> --
> Dr Patrick J. Barrie
> Department of Chemical Engineering and Biotechnology
> University of Cambridge
> Philippa Fawcett Drive, Cambridge CB3 0AS
> 01223 331864
> [hidden email]
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Query on R-squared correlation coefficient for linear regression through origin

Rui Barradas
In reply to this post by Patrick Barrie
Hello,

As for R^2 in Excel for models without an intercept, maybe the following
are relevant.

https://support.microsoft.com/en-us/help/829249/you-will-receive-an-incorrect-r-squared-value-in-the-chart-tool-in-exc

https://stat.ethz.ch/pipermail/r-help/2012-July/318347.html


Hope this helps,

Rui Barradas

Às 11:56 de 27/09/2018, Patrick Barrie escreveu:

> I have a query on the R-squared correlation coefficient for linear
> regression through the origin.
>
> The general expression for R-squared in regression (whether linear or
> non-linear) is
> R-squared = 1 - sum(y-ypredicted)^2 / sum(y-ybar)^2
>
> However, the lm function within R does not seem to use this expression
> when the intercept is constrained to be zero. It gives results different
> to Excel and other data analysis packages.
>
> As an example (using built-in cars dataframe):
>>   cars.lm=lm(dist ~ 0+speed, data=cars)     # linear regression through
> origin
>> summary(cars.lm)$r.squared # report R-squared [1] 0.8962893 >
> 1-deviance(cars.lm)/sum((cars$dist-mean(cars$dist))^2)     # calculates
> R-squared directly [1] 0.6018997 > # The latter corresponds to the value
> reported by Excel (and other data analysis packages) > > # Note that we
> expect R-squared to be smaller for linear regression through the origin
>   > # than for linear regression without a constraint (which is 0.6511 in
> this example)
>
> Does anyone know what R is doing in this case? Is there an option to get
> R to return what I termed the "general" expression for R-squared? The
> adjusted R-squared value is also affected. [Other parameters all seem
> correct.]
>
> Thanks for any help on this issue,
>
> Patrick
>
> P.S. I believe old versions of Excel (before 2003) also had this issue.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.