Increasing number of observations worsen the regression model

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Increasing number of observations worsen the regression model

Raffa
I have the following code:

```

rm(list=ls())
N = 30000
xvar <- runif(N, -10, 10)
e <- rnorm(N, mean=0, sd=1)
yvar <- 1 + 2*xvar + e
plot(xvar,yvar)
lmMod <- lm(yvar~xvar)
print(summary(lmMod))
domain <- seq(min(xvar), max(xvar))    # define a vector of x values to
feed into model
lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    #
add regression line, using `predict` to generate y-values

```

I expected the coefficients to be something similar to [1,2]. Instead R
keeps throwing at me random numbers that are not statistically
significant and don't fit the model, and I have 20k observations. For
example

```

Call:
lm(formula = yvar ~ xvar)

Residuals:
     Min      1Q  Median      3Q     Max
-21.384  -8.908   1.016  10.972  23.663

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007145  0.0670316   0.011    0.991
xvar        0.0168271  0.0116420   1.445    0.148

Residual standard error: 11.61 on 29998 degrees of freedom
Multiple R-squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462

```


The strange thing is that the code works perfectly for N=200 or N=2000.
It's only for larger N that this thing happen U(for example, N=20000). I
have tried to ask for example in CrossValidated
<https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
but the code works for them. Any help?

I am runnign R 3.6.0 on Kubuntu 19.04

Best regards

Raffaele


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Increasing number of observations worsen the regression model

Fox, John
Dear Raffaele,

Using your code, with one modification -- setting the seed for R's random number generator to make the result reproducible -- I get:

> set.seed(12345)

. . .

> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))

Call:
lm(formula = yvar ~ xvar)

Residuals:
    Min      1Q  Median      3Q     Max
-4.0293 -0.6732  0.0021  0.6749  4.2883

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0057713  0.0057529   174.8   <2e-16 ***
xvar        2.0000889  0.0009998  2000.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9964 on 29998 degrees of freedom
Multiple R-squared:  0.9926, Adjusted R-squared:  0.9926
F-statistic: 4.002e+06 on 1 and 29998 DF,  p-value: < 2.2e-16

which is more or less what one would expect.

My guess: you've saved your R workspace from a previous session, and it is then loaded at the start of your R session; something in the saved workspace is affecting the result, although frankly I can't think what that might be.

I hope this helps,
 John

-----------------------------------------------------------------
John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/



> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Raffa
> Sent: Saturday, May 25, 2019 8:38 AM
> To: [hidden email]
> Subject: [R] Increasing number of observations worsen the regression model
>
> I have the following code:
>
> ```
>
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to feed
> into model lines(domain, predict(lmMod, newdata =
> data.frame(xvar=domain)))    # add regression line, using `predict` to generate
> y-values
>
> ```
>
> I expected the coefficients to be something similar to [1,2]. Instead R keeps
> throwing at me random numbers that are not statistically significant and don't
> fit the model, and I have 20k observations. For example
>
> ```
>
> Call:
> lm(formula = yvar ~ xvar)
>
> Residuals:
>      Min      1Q  Median      3Q     Max
> -21.384  -8.908   1.016  10.972  23.663
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145  0.0670316   0.011    0.991
> xvar        0.0168271  0.0116420   1.445    0.148
>
> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
> squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
>
> ```
>
>
> The strange thing is that the code works perfectly for N=200 or N=2000.
> It's only for larger N that this thing happen U(for example, N=20000). I have
> tried to ask for example in CrossValidated
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-
> observations-worsen-the-regression-model>
> but the code works for them. Any help?
>
> I am runnign R 3.6.0 on Kubuntu 19.04
>
> Best regards
>
> Raffaele
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Increasing number of observations worsen the regression model

Raffa
I have solved the problem. It was caused by Intel MKL. Uninstalling
Intel MKL and using OpenBLAS instead fixed the problem completely.

Notice that, using Intel MKL, I was able to reproduce the problem by
computing the regression coefficients directly from the usual formula,
which were returned completely wrong.

My guess is that some optimization of Intel MKL which is activated in
"large" matrices give completely wrong result (I don't know which
operation exactly)

Thanks,

Best,

Raffaele Mancuso

On 26/05/19 16:06, Fox, John wrote:

> Dear Raffaele,
>
> Using your code, with one modification -- setting the seed for R's random number generator to make the result reproducible -- I get:
>
>> set.seed(12345)
> . . .
>
>> lmMod <- lm(yvar~xvar)
>> print(summary(lmMod))
> Call:
> lm(formula = yvar ~ xvar)
>
> Residuals:
>      Min      1Q  Median      3Q     Max
> -4.0293 -0.6732  0.0021  0.6749  4.2883
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.0057713  0.0057529   174.8   <2e-16 ***
> xvar        2.0000889  0.0009998  2000.4   <2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.9964 on 29998 degrees of freedom
> Multiple R-squared:  0.9926, Adjusted R-squared:  0.9926
> F-statistic: 4.002e+06 on 1 and 29998 DF,  p-value: < 2.2e-16
>
> which is more or less what one would expect.
>
> My guess: you've saved your R workspace from a previous session, and it is then loaded at the start of your R session; something in the saved workspace is affecting the result, although frankly I can't think what that might be.
>
> I hope this helps,
>   John
>
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
>
>
>
>> -----Original Message-----
>> From: R-help [mailto:[hidden email]] On Behalf Of Raffa
>> Sent: Saturday, May 25, 2019 8:38 AM
>> To: [hidden email]
>> Subject: [R] Increasing number of observations worsen the regression model
>>
>> I have the following code:
>>
>> ```
>>
>> rm(list=ls())
>> N = 30000
>> xvar <- runif(N, -10, 10)
>> e <- rnorm(N, mean=0, sd=1)
>> yvar <- 1 + 2*xvar + e
>> plot(xvar,yvar)
>> lmMod <- lm(yvar~xvar)
>> print(summary(lmMod))
>> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to feed
>> into model lines(domain, predict(lmMod, newdata =
>> data.frame(xvar=domain)))    # add regression line, using `predict` to generate
>> y-values
>>
>> ```
>>
>> I expected the coefficients to be something similar to [1,2]. Instead R keeps
>> throwing at me random numbers that are not statistically significant and don't
>> fit the model, and I have 20k observations. For example
>>
>> ```
>>
>> Call:
>> lm(formula = yvar ~ xvar)
>>
>> Residuals:
>>       Min      1Q  Median      3Q     Max
>> -21.384  -8.908   1.016  10.972  23.663
>>
>> Coefficients:
>>                Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 0.0007145  0.0670316   0.011    0.991
>> xvar        0.0168271  0.0116420   1.445    0.148
>>
>> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
>> squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
>> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
>>
>> ```
>>
>>
>> The strange thing is that the code works perfectly for N=200 or N=2000.
>> It's only for larger N that this thing happen U(for example, N=20000). I have
>> tried to ask for example in CrossValidated
>> <https://stats.stackexchange.com/questions/410050/increasing-number-of-
>> observations-worsen-the-regression-model>
>> but the code works for them. Any help?
>>
>> I am runnign R 3.6.0 on Kubuntu 19.04
>>
>> Best regards
>>
>> Raffaele
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Increasing number of observations worsen the regression model

R help mailing list-2
In reply to this post by Raffa
Raffa,

I ran this on a MacOS machine and got what you expected. I added a call to sessionInfo() for your information.

> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))

Call:
lm(formula = yvar ~ xvar)

Residuals:
    Min      1Q  Median      3Q     Max
-4.2407 -0.6738 -0.0031  0.6822  4.0619

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0059022  0.0057370   175.3   <2e-16 ***
xvar        2.0005811  0.0009918  2017.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9937 on 29998 degrees of freedom
Multiple R-squared:  0.9927, Adjusted R-squared:  0.9927
F-statistic: 4.069e+06 on 1 and 29998 DF,  p-value: < 2.2e-16

> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    # add regression line, using `predict` to generate y-values
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] compiler_3.6.0




R. Mark Sharp, Ph.D.
Data Scientist and Biomedical Statistical Consultant
7526 Meadow Green St.
San Antonio, TX 78251
mobile: 210-218-2868
[hidden email]











> On May 25, 2019, at 7:38 AM, Raffa <[hidden email]> wrote:
>
> I have the following code:
>
> ```
>
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to
> feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    #
> add regression line, using `predict` to generate y-values
>
> ```
>
> I expected the coefficients to be something similar to [1,2]. Instead R
> keeps throwing at me random numbers that are not statistically
> significant and don't fit the model, and I have 20k observations. For
> example
>
> ```
>
> Call:
> lm(formula = yvar ~ xvar)
>
> Residuals:
>     Min      1Q  Median      3Q     Max
> -21.384  -8.908   1.016  10.972  23.663
>
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145  0.0670316   0.011    0.991
> xvar        0.0168271  0.0116420   1.445    0.148
>
> Residual standard error: 11.61 on 29998 degrees of freedom
> Multiple R-squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
>
> ```
>
>
> The strange thing is that the code works perfectly for N=200 or N=2000.
> It's only for larger N that this thing happen U(for example, N=20000). I
> have tried to ask for example in CrossValidated
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
> but the code works for them. Any help?
>
> I am runnign R 3.6.0 on Kubuntu 19.04
>
> Best regards
>
> Raffaele
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Increasing number of observations worsen the regression model

Ivan Krylov
In reply to this post by Raffa
On Sat, 25 May 2019 14:38:07 +0200
Raffa <[hidden email]> wrote:

> I have tried to ask for example in CrossValidated
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
> but the code works for them. Any help?

In the comments you note that the problem went away after you replaced
Intel MKL with OpenBLAS. This is important.

The code that fits linear models in R is somewhat complex[*]; if
you want to get to the bottom of the problem, you may have to take
parts of it and feed them differently-sized linear regression problems
until you narrow it down to a specific set of calls to BLAS or LAPACK
functions which Intel MKL provides.

One option would be to ask at Intel MKL forums[**].

--
Best regards,
Ivan

[*]
https://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html

[**] https://software.intel.com/en-us/forums/intel-math-kernel-library/

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.