Problem with lm.resid() when weights are provided

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with lm.resid() when weights are provided

hamedhm
Dear R Help Team.

I get some weird results when I use the lm function with weight. The issue
can be reproduced by the example below:


The input data is (weights are intentionally designed to reflect some
structures in the data)


> df
y x weight
 1.51156139  0.55209240 2.117337e-34
-0.63653132 -0.12599316 2.117337e-34
 0.37782776  0.42095384 4.934135e-31
 3.03792318  1.40315446 2.679495e-24
 1.53646523  0.46076858 2.679495e-24
-2.37727874 -0.73963576 6.244160e-21
 0.37183065  0.20407468 1.455107e-17
-1.53917553 -0.95519361 1.455107e-17
 1.10926675  0.03897129 3.390908e-14
-0.37786333 -0.17523593 3.390908e-14
 2.43973603  0.97970095 7.902000e-11
-0.35432394 -0.03742559 7.902000e-11
 2.19296613  1.00355263 4.289362e-04
 0.49845532  0.34816207 4.289362e-04
 1.25005260  0.76306225 5.000000e-01
 0.84360691  0.45152356 5.000000e-01
 0.29565993  0.53880068 5.000000e-01
-0.54081334 -0.28104525 5.000000e-01
 0.83612836 -0.12885659 9.995711e-01
-1.42526769 -0.87107631 9.999998e-01
 0.10204789 -0.11649899 1.000000e+00
 1.14292898  0.37249631 1.000000e+00
-3.02942081 -1.28966997 1.000000e+00
-1.37549764 -0.74676145 1.000000e+00
-2.00118016 -0.55182759 1.000000e+00
-4.24441674 -1.94603608 1.000000e+00
 1.17168144  1.00868008 1.000000e+00
 2.64007761  1.26333069 1.000000e+00
 1.98550114  1.18509599 1.000000e+00
-0.58941683 -0.61972416 9.999998e-01
-4.57559611 -2.30914920 9.995711e-01
-0.82610544 -0.39347576 9.995711e-01
-0.02768220  0.20076910 9.995711e-01
 0.78186399  0.25690215 9.995711e-01
-0.88314153 -0.20200148 5.000000e-01
-4.17076452 -2.03547588 5.000000e-01
 0.93373070  0.54190626 4.289362e-04
-0.08517734  0.17692491 4.289362e-04
-4.47546619 -2.14876688 4.289362e-04
-1.65509103 -0.76898087 4.289362e-04
-0.39403030 -0.12689705 4.289362e-04
 0.01203300 -0.18689898 1.841442e-07
-4.82762639 -2.31391121 1.841442e-07
-0.72658380 -0.39751171 3.397282e-14
-2.35886866 -1.01082109 0.000000e+00
-2.03762707 -0.96439902 0.000000e+00
 0.90115123  0.60172286 0.000000e+00
 1.55999194  0.83433953 0.000000e+00
 3.07994058  1.30942776 0.000000e+00
 1.78871462  1.10605530 0.000000e+00



Running simple linear model returns:

> lm(y~x,data=df)

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x
   -0.04173      2.03790

and
> max(resid(lm(y~x,data=df)))
[1] 1.14046


*HOWEVER if I use the weighted model then:*

lm(formula = y ~ x, data = df, weights = df$weights)

Coefficients:
(Intercept)            x
   -0.05786      1.96087

and
> max(resid(lm(y~x,data=df,weights=df$weights)))
[1] 60.91888


as you see, the estimation of the coefficients are nearly the same but the
resid() function returns a giant residual (I have some cases where the
value is much much higher). Further, if I calculate the residuals by
simply predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the true
value for the residuals.


Thanks.

Please do not hesitate to contact me for more details.
Regards,
Hamed.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with lm.resid() when weights are provided

Fox, John
Dear Hamed,

I don't think that anyone has picked up on this problem.

What's peculiar about your weights is that several are 0 within rounding error but not exactly 0:

> head(df)
           y          x       weight
1  1.5115614  0.5520924 2.117337e-34
2 -0.6365313 -0.1259932 2.117337e-34
3  0.3778278  0.4209538 4.934135e-31
4  3.0379232  1.4031545 2.679495e-24
5  1.5364652  0.4607686 2.679495e-24
6 -2.3772787 -0.7396358 6.244160e-21

I can reproduce the results that you report:

> (mod.1 <- lm(y ~ x, data=df))

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
   -0.04173      2.03790  

> max(resid(mod.1))
[1] 1.14046
> (mod.2 <- lm(y ~ x, data=df, weights=weight))

Call:
lm(formula = y ~ x, data = df, weights = weight)

Coefficients:
(Intercept)            x  
   -0.05786      1.96087  

> max(resid(mod.2))
[1] 36.84939

But the problem disappears when the tiny nonzero weight are set to 0:

> df2 <- df
> df2$weight <- zapsmall(df2$weight)
> head(df2)
           y          x weight
1  1.5115614  0.5520924      0
2 -0.6365313 -0.1259932      0
3  0.3778278  0.4209538      0
4  3.0379232  1.4031545      0
5  1.5364652  0.4607686      0
6 -2.3772787 -0.7396358      0
> (mod.3 <- update(mod.2, data=df2))

Call:
lm(formula = y ~ x, data = df2, weights = weight)

Coefficients:
(Intercept)            x  
   -0.05786      1.96087  

> max(resid(mod.3))
[1] 1.146663

I don't know exactly why this happens, but suspect numerical instability produced by the near-zero weights, which are smaller than the machine double-epsilon

> .Machine$double.neg.eps
[1] 1.110223e-16

The problem also disappears, e.g., if the tiny weight are set to 1e-15 rather than 0.

I hope this helps,
 John

-----------------------------------------------------------------
John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/



> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Hamed Ha
> Sent: Tuesday, September 11, 2018 8:39 AM
> To: [hidden email]
> Subject: [R] Problem with lm.resid() when weights are provided
>
> Dear R Help Team.
>
> I get some weird results when I use the lm function with weight. The issue can
> be reproduced by the example below:
>
>
> The input data is (weights are intentionally designed to reflect some
> structures in the data)
>
>
> > df
> y x weight
>  1.51156139  0.55209240 2.117337e-34
> -0.63653132 -0.12599316 2.117337e-34
>  0.37782776  0.42095384 4.934135e-31
>  3.03792318  1.40315446 2.679495e-24
>  1.53646523  0.46076858 2.679495e-24
> -2.37727874 -0.73963576 6.244160e-21
>  0.37183065  0.20407468 1.455107e-17
> -1.53917553 -0.95519361 1.455107e-17
>  1.10926675  0.03897129 3.390908e-14
> -0.37786333 -0.17523593 3.390908e-14
>  2.43973603  0.97970095 7.902000e-11
> -0.35432394 -0.03742559 7.902000e-11
>  2.19296613  1.00355263 4.289362e-04
>  0.49845532  0.34816207 4.289362e-04
>  1.25005260  0.76306225 5.000000e-01
>  0.84360691  0.45152356 5.000000e-01
>  0.29565993  0.53880068 5.000000e-01
> -0.54081334 -0.28104525 5.000000e-01
>  0.83612836 -0.12885659 9.995711e-01
> -1.42526769 -0.87107631 9.999998e-01
>  0.10204789 -0.11649899 1.000000e+00
>  1.14292898  0.37249631 1.000000e+00
> -3.02942081 -1.28966997 1.000000e+00
> -1.37549764 -0.74676145 1.000000e+00
> -2.00118016 -0.55182759 1.000000e+00
> -4.24441674 -1.94603608 1.000000e+00
>  1.17168144  1.00868008 1.000000e+00
>  2.64007761  1.26333069 1.000000e+00
>  1.98550114  1.18509599 1.000000e+00
> -0.58941683 -0.61972416 9.999998e-01
> -4.57559611 -2.30914920 9.995711e-01
> -0.82610544 -0.39347576 9.995711e-01
> -0.02768220  0.20076910 9.995711e-01
>  0.78186399  0.25690215 9.995711e-01
> -0.88314153 -0.20200148 5.000000e-01
> -4.17076452 -2.03547588 5.000000e-01
>  0.93373070  0.54190626 4.289362e-04
> -0.08517734  0.17692491 4.289362e-04
> -4.47546619 -2.14876688 4.289362e-04
> -1.65509103 -0.76898087 4.289362e-04
> -0.39403030 -0.12689705 4.289362e-04
>  0.01203300 -0.18689898 1.841442e-07
> -4.82762639 -2.31391121 1.841442e-07
> -0.72658380 -0.39751171 3.397282e-14
> -2.35886866 -1.01082109 0.000000e+00
> -2.03762707 -0.96439902 0.000000e+00
>  0.90115123  0.60172286 0.000000e+00
>  1.55999194  0.83433953 0.000000e+00
>  3.07994058  1.30942776 0.000000e+00
>  1.78871462  1.10605530 0.000000e+00
>
>
>
> Running simple linear model returns:
>
> > lm(y~x,data=df)
>
> Call:
> lm(formula = y ~ x, data = df)
>
> Coefficients:
> (Intercept)            x
>    -0.04173      2.03790
>
> and
> > max(resid(lm(y~x,data=df)))
> [1] 1.14046
>
>
> *HOWEVER if I use the weighted model then:*
>
> lm(formula = y ~ x, data = df, weights = df$weights)
>
> Coefficients:
> (Intercept)            x
>    -0.05786      1.96087
>
> and
> > max(resid(lm(y~x,data=df,weights=df$weights)))
> [1] 60.91888
>
>
> as you see, the estimation of the coefficients are nearly the same but the
> resid() function returns a giant residual (I have some cases where the value is
> much much higher). Further, if I calculate the residuals by simply
> predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the true value for
> the residuals.
>
>
> Thanks.
>
> Please do not hesitate to contact me for more details.
> Regards,
> Hamed.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with lm.resid() when weights are provided

Fox, John
Dear Hamed,

When you post a question to r-help, generally you should cc subsequent messages there as well, as I've done to this response.

The algorithm that lm() uses is much more numerically stable than inverting the weighted sum-of-squares-and-product matrix. If you want to see how the computations are done, look at lm.wfit(), in which the residuals and fits are computed as

    z$residuals <- z$residuals/wts
    z$fitted.values <- y - z$residuals

Zero weights are handled specially, and your tiny weights are thus the source of the problem. When you divide by a number less than the machine double-epsilon, you can't expect numerically stable results. I suppose that lm.wfit() could check for 0 weights to a tolerance rather than exactly.

John

> -----Original Message-----
> From: Hamed Ha [mailto:[hidden email]]
> Sent: Friday, September 14, 2018 5:34 PM
> To: Fox, John <[hidden email]>
> Subject: Re: [R] Problem with lm.resid() when weights are provided
>
> Hi John,
>
> Thank you for your reply.
>
> I agree that the small weights are the potential source of the instability in the
> result. I also suspected that there are some failure/bugs in the actual
> algorithm that R uses for fitting the model. I remember that at some points I
> checked the theoretical estimation of the parameters, solve(t(x) %*% w %*%
> x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol parameter in
> solve() to a super small value) and realised  that lm() and the theoretical
> results match together. That is the parameter estimation is right in R.
> Moreover, I checked the predictions, predict(lm.fit), and it was right. Then the
> only source of error remained was resid() function. I further checked this
> function and it is nothing more than calling a sub-element from and lm() fit.
> Putting all together, I think that there is something wrong/bug/miss-
> configuration in the lm() algorithm and I highly recommend the R core team to
> fix that.
>
> Please feel free to contact me for more details if required.
>
> Warm regards,
> Hamed.
>
>
>
>
>
>
>
>
>
> On Fri, 14 Sep 2018 at 13:35, Fox, John <[hidden email]
> <mailto:[hidden email]> > wrote:
>
>
> Dear Hamed,
>
> I don't think that anyone has picked up on this problem.
>
> What's peculiar about your weights is that several are 0 within
> rounding error but not exactly 0:
>
> > head(df)
>           y          x       weight
> 1  1.5115614  0.5520924 2.117337e-34
> 2 -0.6365313 -0.1259932 2.117337e-34
> 3  0.3778278  0.4209538 4.934135e-31
> 4  3.0379232  1.4031545 2.679495e-24
> 5  1.5364652  0.4607686 2.679495e-24
> 6 -2.3772787 -0.7396358 6.244160e-21
>
> I can reproduce the results that you report:
>
> > (mod.1 <- lm(y ~ x, data=df))
>
> Call:
> lm(formula = y ~ x, data = df)
>
> Coefficients:
> (Intercept)            x
>   -0.04173      2.03790
>
> > max(resid(mod.1))
> [1] 1.14046
> > (mod.2 <- lm(y ~ x, data=df, weights=weight))
>
> Call:
> lm(formula = y ~ x, data = df, weights = weight)
>
> Coefficients:
> (Intercept)            x
>   -0.05786      1.96087
>
> > max(resid(mod.2))
> [1] 36.84939
>
> But the problem disappears when the tiny nonzero weight are set to 0:
>
> > df2 <- df
> > df2$weight <- zapsmall(df2$weight)
> > head(df2)
>           y          x weight
> 1  1.5115614  0.5520924      0
> 2 -0.6365313 -0.1259932      0
> 3  0.3778278  0.4209538      0
> 4  3.0379232  1.4031545      0
> 5  1.5364652  0.4607686      0
> 6 -2.3772787 -0.7396358      0
> > (mod.3 <- update(mod.2, data=df2))
>
> Call:
> lm(formula = y ~ x, data = df2, weights = weight)
>
> Coefficients:
> (Intercept)            x
>   -0.05786      1.96087
>
> > max(resid(mod.3))
> [1] 1.146663
>
> I don't know exactly why this happens, but suspect numerical
> instability produced by the near-zero weights, which are smaller than the
> machine double-epsilon
>
> > .Machine$double.neg.eps
> [1] 1.110223e-16
>
> The problem also disappears, e.g., if the tiny weight are set to 1e-15
> rather than 0.
>
> I hope this helps,
> John
>
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
>
>
>
> > -----Original Message-----
> > From: R-help [mailto:[hidden email] <mailto:r-help-
> [hidden email]> ] On Behalf Of Hamed Ha
> > Sent: Tuesday, September 11, 2018 8:39 AM
> > To: [hidden email] <mailto:[hidden email]>
> > Subject: [R] Problem with lm.resid() when weights are provided
> >
> > Dear R Help Team.
> >
> > I get some weird results when I use the lm function with weight. The
> issue can
> > be reproduced by the example below:
> >
> >
> > The input data is (weights are intentionally designed to reflect some
> > structures in the data)
> >
> >
> > > df
> > y x weight
> >  1.51156139  0.55209240 2.117337e-34
> > -0.63653132 -0.12599316 2.117337e-34
> >  0.37782776  0.42095384 4.934135e-31
> >  3.03792318  1.40315446 2.679495e-24
> >  1.53646523  0.46076858 2.679495e-24
> > -2.37727874 -0.73963576 6.244160e-21
> >  0.37183065  0.20407468 1.455107e-17
> > -1.53917553 -0.95519361 1.455107e-17
> >  1.10926675  0.03897129 3.390908e-14
> > -0.37786333 -0.17523593 3.390908e-14
> >  2.43973603  0.97970095 7.902000e-11
> > -0.35432394 -0.03742559 7.902000e-11
> >  2.19296613  1.00355263 4.289362e-04
> >  0.49845532  0.34816207 4.289362e-04
> >  1.25005260  0.76306225 5.000000e-01
> >  0.84360691  0.45152356 5.000000e-01
> >  0.29565993  0.53880068 5.000000e-01
> > -0.54081334 -0.28104525 5.000000e-01
> >  0.83612836 -0.12885659 9.995711e-01
> > -1.42526769 -0.87107631 9.999998e-01
> >  0.10204789 -0.11649899 1.000000e+00
> >  1.14292898  0.37249631 1.000000e+00
> > -3.02942081 -1.28966997 1.000000e+00
> > -1.37549764 -0.74676145 1.000000e+00
> > -2.00118016 -0.55182759 1.000000e+00
> > -4.24441674 -1.94603608 1.000000e+00
> >  1.17168144  1.00868008 1.000000e+00
> >  2.64007761  1.26333069 1.000000e+00
> >  1.98550114  1.18509599 1.000000e+00
> > -0.58941683 -0.61972416 9.999998e-01
> > -4.57559611 -2.30914920 9.995711e-01
> > -0.82610544 -0.39347576 9.995711e-01
> > -0.02768220  0.20076910 9.995711e-01
> >  0.78186399  0.25690215 9.995711e-01
> > -0.88314153 -0.20200148 5.000000e-01
> > -4.17076452 -2.03547588 5.000000e-01
> >  0.93373070  0.54190626 4.289362e-04
> > -0.08517734  0.17692491 4.289362e-04
> > -4.47546619 -2.14876688 4.289362e-04
> > -1.65509103 -0.76898087 4.289362e-04
> > -0.39403030 -0.12689705 4.289362e-04
> >  0.01203300 -0.18689898 1.841442e-07
> > -4.82762639 -2.31391121 1.841442e-07
> > -0.72658380 -0.39751171 3.397282e-14
> > -2.35886866 -1.01082109 0.000000e+00
> > -2.03762707 -0.96439902 0.000000e+00
> >  0.90115123  0.60172286 0.000000e+00
> >  1.55999194  0.83433953 0.000000e+00
> >  3.07994058  1.30942776 0.000000e+00
> >  1.78871462  1.10605530 0.000000e+00
> >
> >
> >
> > Running simple linear model returns:
> >
> > > lm(y~x,data=df)
> >
> > Call:
> > lm(formula = y ~ x, data = df)
> >
> > Coefficients:
> > (Intercept)            x
> >    -0.04173      2.03790
> >
> > and
> > > max(resid(lm(y~x,data=df)))
> > [1] 1.14046
> >
> >
> > *HOWEVER if I use the weighted model then:*
> >
> > lm(formula = y ~ x, data = df, weights = df$weights)
> >
> > Coefficients:
> > (Intercept)            x
> >    -0.05786      1.96087
> >
> > and
> > > max(resid(lm(y~x,data=df,weights=df$weights)))
> > [1] 60.91888
> >
> >
> > as you see, the estimation of the coefficients are nearly the same
> but the
> > resid() function returns a giant residual (I have some cases where
> the value is
> > much much higher). Further, if I calculate the residuals by simply
> > predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the true
> value for
> > the residuals.
> >
> >
> > Thanks.
> >
> > Please do not hesitate to contact me for more details.
> > Regards,
> > Hamed.
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] <mailto:[hidden email]>  mailing list --
> To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible
> code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with lm.resid() when weights are provided

hamedhm
H i John,

Thank you for your reply.

I see your point, thanks. I checked lm.wfit() and realised that there is a
tol parameter that is already set to 10^-7. This is not even the half
decimal to the machine precision. Furthermore, plying with tol parameter
does not solve the problem, as far as I checked.

I still see this issue as critical and we should report it to the R core
team to be investigated more. What do you think?


Regards,
Hamed.


On Fri, 14 Sep 2018 at 22:46, Fox, John <[hidden email]> wrote:

> Dear Hamed,
>
> When you post a question to r-help, generally you should cc subsequent
> messages there as well, as I've done to this response.
>
> The algorithm that lm() uses is much more numerically stable than
> inverting the weighted sum-of-squares-and-product matrix. If you want to
> see how the computations are done, look at lm.wfit(), in which the
> residuals and fits are computed as
>
>     z$residuals <- z$residuals/wts
>     z$fitted.values <- y - z$residuals
>
> Zero weights are handled specially, and your tiny weights are thus the
> source of the problem. When you divide by a number less than the machine
> double-epsilon, you can't expect numerically stable results. I suppose that
> lm.wfit() could check for 0 weights to a tolerance rather than exactly.
>
> John
>
> > -----Original Message-----
> > From: Hamed Ha [mailto:[hidden email]]
> > Sent: Friday, September 14, 2018 5:34 PM
> > To: Fox, John <[hidden email]>
> > Subject: Re: [R] Problem with lm.resid() when weights are provided
> >
> > Hi John,
> >
> > Thank you for your reply.
> >
> > I agree that the small weights are the potential source of the
> instability in the
> > result. I also suspected that there are some failure/bugs in the actual
> > algorithm that R uses for fitting the model. I remember that at some
> points I
> > checked the theoretical estimation of the parameters, solve(t(x) %*% w
> %*%
> > x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol
> parameter in
> > solve() to a super small value) and realised  that lm() and the
> theoretical
> > results match together. That is the parameter estimation is right in R.
> > Moreover, I checked the predictions, predict(lm.fit), and it was right.
> Then the
> > only source of error remained was resid() function. I further checked
> this
> > function and it is nothing more than calling a sub-element from and lm()
> fit.
> > Putting all together, I think that there is something wrong/bug/miss-
> > configuration in the lm() algorithm and I highly recommend the R core
> team to
> > fix that.
> >
> > Please feel free to contact me for more details if required.
> >
> > Warm regards,
> > Hamed.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, 14 Sep 2018 at 13:35, Fox, John <[hidden email]
> > <mailto:[hidden email]> > wrote:
> >
> >
> >       Dear Hamed,
> >
> >       I don't think that anyone has picked up on this problem.
> >
> >       What's peculiar about your weights is that several are 0 within
> > rounding error but not exactly 0:
> >
> >       > head(df)
> >                  y          x       weight
> >       1  1.5115614  0.5520924 2.117337e-34
> >       2 -0.6365313 -0.1259932 2.117337e-34
> >       3  0.3778278  0.4209538 4.934135e-31
> >       4  3.0379232  1.4031545 2.679495e-24
> >       5  1.5364652  0.4607686 2.679495e-24
> >       6 -2.3772787 -0.7396358 6.244160e-21
> >
> >       I can reproduce the results that you report:
> >
> >       > (mod.1 <- lm(y ~ x, data=df))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.04173      2.03790
> >
> >       > max(resid(mod.1))
> >       [1] 1.14046
> >       > (mod.2 <- lm(y ~ x, data=df, weights=weight))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df, weights = weight)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.05786      1.96087
> >
> >       > max(resid(mod.2))
> >       [1] 36.84939
> >
> >       But the problem disappears when the tiny nonzero weight are set to
> 0:
> >
> >       > df2 <- df
> >       > df2$weight <- zapsmall(df2$weight)
> >       > head(df2)
> >                  y          x weight
> >       1  1.5115614  0.5520924      0
> >       2 -0.6365313 -0.1259932      0
> >       3  0.3778278  0.4209538      0
> >       4  3.0379232  1.4031545      0
> >       5  1.5364652  0.4607686      0
> >       6 -2.3772787 -0.7396358      0
> >       > (mod.3 <- update(mod.2, data=df2))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df2, weights = weight)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.05786      1.96087
> >
> >       > max(resid(mod.3))
> >       [1] 1.146663
> >
> >       I don't know exactly why this happens, but suspect numerical
> > instability produced by the near-zero weights, which are smaller than the
> > machine double-epsilon
> >
> >       > .Machine$double.neg.eps
> >       [1] 1.110223e-16
> >
> >       The problem also disappears, e.g., if the tiny weight are set to
> 1e-15
> > rather than 0.
> >
> >       I hope this helps,
> >        John
> >
> >       -----------------------------------------------------------------
> >       John Fox
> >       Professor Emeritus
> >       McMaster University
> >       Hamilton, Ontario, Canada
> >       Web: https://socialsciences.mcmaster.ca/jfox/
> >
> >
> >
> >       > -----Original Message-----
> >       > From: R-help [mailto:[hidden email] <mailto:
> r-help-
> > [hidden email]> ] On Behalf Of Hamed Ha
> >       > Sent: Tuesday, September 11, 2018 8:39 AM
> >       > To: [hidden email] <mailto:[hidden email]>
> >       > Subject: [R] Problem with lm.resid() when weights are provided
> >       >
> >       > Dear R Help Team.
> >       >
> >       > I get some weird results when I use the lm function with weight.
> The
> > issue can
> >       > be reproduced by the example below:
> >       >
> >       >
> >       > The input data is (weights are intentionally designed to reflect
> some
> >       > structures in the data)
> >       >
> >       >
> >       > > df
> >       > y x weight
> >       >  1.51156139  0.55209240 2.117337e-34
> >       > -0.63653132 -0.12599316 2.117337e-34
> >       >  0.37782776  0.42095384 4.934135e-31
> >       >  3.03792318  1.40315446 2.679495e-24
> >       >  1.53646523  0.46076858 2.679495e-24
> >       > -2.37727874 -0.73963576 6.244160e-21
> >       >  0.37183065  0.20407468 1.455107e-17
> >       > -1.53917553 -0.95519361 1.455107e-17
> >       >  1.10926675  0.03897129 3.390908e-14
> >       > -0.37786333 -0.17523593 3.390908e-14
> >       >  2.43973603  0.97970095 7.902000e-11
> >       > -0.35432394 -0.03742559 7.902000e-11
> >       >  2.19296613  1.00355263 4.289362e-04
> >       >  0.49845532  0.34816207 4.289362e-04
> >       >  1.25005260  0.76306225 5.000000e-01
> >       >  0.84360691  0.45152356 5.000000e-01
> >       >  0.29565993  0.53880068 5.000000e-01
> >       > -0.54081334 -0.28104525 5.000000e-01
> >       >  0.83612836 -0.12885659 9.995711e-01
> >       > -1.42526769 -0.87107631 9.999998e-01
> >       >  0.10204789 -0.11649899 1.000000e+00
> >       >  1.14292898  0.37249631 1.000000e+00
> >       > -3.02942081 -1.28966997 1.000000e+00
> >       > -1.37549764 -0.74676145 1.000000e+00
> >       > -2.00118016 -0.55182759 1.000000e+00
> >       > -4.24441674 -1.94603608 1.000000e+00
> >       >  1.17168144  1.00868008 1.000000e+00
> >       >  2.64007761  1.26333069 1.000000e+00
> >       >  1.98550114  1.18509599 1.000000e+00
> >       > -0.58941683 -0.61972416 9.999998e-01
> >       > -4.57559611 -2.30914920 9.995711e-01
> >       > -0.82610544 -0.39347576 9.995711e-01
> >       > -0.02768220  0.20076910 9.995711e-01
> >       >  0.78186399  0.25690215 9.995711e-01
> >       > -0.88314153 -0.20200148 5.000000e-01
> >       > -4.17076452 -2.03547588 5.000000e-01
> >       >  0.93373070  0.54190626 4.289362e-04
> >       > -0.08517734  0.17692491 4.289362e-04
> >       > -4.47546619 -2.14876688 4.289362e-04
> >       > -1.65509103 -0.76898087 4.289362e-04
> >       > -0.39403030 -0.12689705 4.289362e-04
> >       >  0.01203300 -0.18689898 1.841442e-07
> >       > -4.82762639 -2.31391121 1.841442e-07
> >       > -0.72658380 -0.39751171 3.397282e-14
> >       > -2.35886866 -1.01082109 0.000000e+00
> >       > -2.03762707 -0.96439902 0.000000e+00
> >       >  0.90115123  0.60172286 0.000000e+00
> >       >  1.55999194  0.83433953 0.000000e+00
> >       >  3.07994058  1.30942776 0.000000e+00
> >       >  1.78871462  1.10605530 0.000000e+00
> >       >
> >       >
> >       >
> >       > Running simple linear model returns:
> >       >
> >       > > lm(y~x,data=df)
> >       >
> >       > Call:
> >       > lm(formula = y ~ x, data = df)
> >       >
> >       > Coefficients:
> >       > (Intercept)            x
> >       >    -0.04173      2.03790
> >       >
> >       > and
> >       > > max(resid(lm(y~x,data=df)))
> >       > [1] 1.14046
> >       >
> >       >
> >       > *HOWEVER if I use the weighted model then:*
> >       >
> >       > lm(formula = y ~ x, data = df, weights = df$weights)
> >       >
> >       > Coefficients:
> >       > (Intercept)            x
> >       >    -0.05786      1.96087
> >       >
> >       > and
> >       > > max(resid(lm(y~x,data=df,weights=df$weights)))
> >       > [1] 60.91888
> >       >
> >       >
> >       > as you see, the estimation of the coefficients are nearly the
> same
> > but the
> >       > resid() function returns a giant residual (I have some cases
> where
> > the value is
> >       > much much higher). Further, if I calculate the residuals by
> simply
> >       > predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the
> true
> > value for
> >       > the residuals.
> >       >
> >       >
> >       > Thanks.
> >       >
> >       > Please do not hesitate to contact me for more details.
> >       > Regards,
> >       > Hamed.
> >       >
> >       >       [[alternative HTML version deleted]]
> >       >
> >       > ______________________________________________
> >       > [hidden email] <mailto:[hidden email]>  mailing
> list --
> > To UNSUBSCRIBE and more, see
> >       > https://stat.ethz.ch/mailman/listinfo/r-help
> >       > PLEASE do read the posting guide
> http://www.R-project.org/posting-
> >       > guide.html
> >       > and provide commented, minimal, self-contained, reproducible
> > code.
> >
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with lm.resid() when weights are provided

Fox, John
Dear Hamed,

> -----Original Message-----
> From: Hamed Ha [mailto:[hidden email]]
> Sent: Monday, September 17, 2018 3:56 AM
> To: Fox, John <[hidden email]>
> Cc: [hidden email]
> Subject: Re: [R] Problem with lm.resid() when weights are provided
>
> H i John,
>
>
> Thank you for your reply.
>
>
> I see your point, thanks. I checked lm.wfit() and realised that there is a tol
> parameter that is already set to 10^-7. This is not even the half decimal to the
> machine precision. Furthermore, plying with tol parameter does not solve the
> problem, as far as I checked.

tol plays a different role in lm.wfit(). It's for the QR decomposition (done in C code), I suppose to determine the rank of the weighted model matrix. Generally in this kind of context, you'd use something like the square root of the machine double epsilon to define a number that's effectively 0, and the tolerance used here isn't too far off that -- about an order of magnitude larger.
 
I'm not an expert in computer arithmetic or numerical linear algebra, so I don't have anything more to say about this.

>
>
> I still see this issue as critical and we should report it to the R core team to be
> investigated more. What do you think?

I don't think that it's a critical issue because it isn't sensible to specify nonzero weights so close to 0. A simple solution is to change these weights to 0 in your code calling lm().

That said, I suppose that it might be better to make lm.wfit() more robust to near-zero weights. If you feel strongly about this, you can file a bug report, but I'm not interested in pursuing it.

Best,
 John

>
>
> Regards,
> Hamed.
>
>
> On Fri, 14 Sep 2018 at 22:46, Fox, John <[hidden email]
> <mailto:[hidden email]> > wrote:
>
>
> Dear Hamed,
>
> When you post a question to r-help, generally you should cc
> subsequent messages there as well, as I've done to this response.
>
> The algorithm that lm() uses is much more numerically stable than
> inverting the weighted sum-of-squares-and-product matrix. If you want to see
> how the computations are done, look at lm.wfit(), in which the residuals and
> fits are computed as
>
>    z$residuals <- z$residuals/wts
>    z$fitted.values <- y - z$residuals
>
> Zero weights are handled specially, and your tiny weights are thus the
> source of the problem. When you divide by a number less than the machine
> double-epsilon, you can't expect numerically stable results. I suppose that
> lm.wfit() could check for 0 weights to a tolerance rather than exactly.
>
> John
>
> > -----Original Message-----
> > From: Hamed Ha [mailto:[hidden email]
> <mailto:[hidden email]> ]
> > Sent: Friday, September 14, 2018 5:34 PM
> > To: Fox, John <[hidden email] <mailto:[hidden email]> >
> > Subject: Re: [R] Problem with lm.resid() when weights are provided
> >
> > Hi John,
> >
> > Thank you for your reply.
> >
> > I agree that the small weights are the potential source of the
> instability in the
> > result. I also suspected that there are some failure/bugs in the actual
> > algorithm that R uses for fitting the model. I remember that at some
> points I
> > checked the theoretical estimation of the parameters, solve(t(x)
> %*% w %*%
> > x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol
> parameter in
> > solve() to a super small value) and realised  that lm() and the
> theoretical
> > results match together. That is the parameter estimation is right in
> R.
> > Moreover, I checked the predictions, predict(lm.fit), and it was right.
> Then the
> > only source of error remained was resid() function. I further checked
> this
> > function and it is nothing more than calling a sub-element from and
> lm() fit.
> > Putting all together, I think that there is something wrong/bug/miss-
> > configuration in the lm() algorithm and I highly recommend the R
> core team to
> > fix that.
> >
> > Please feel free to contact me for more details if required.
> >
> > Warm regards,
> > Hamed.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, 14 Sep 2018 at 13:35, Fox, John <[hidden email]
> <mailto:[hidden email]>
> > <mailto:[hidden email] <mailto:[hidden email]> > > wrote:
> >
> >
> >       Dear Hamed,
> >
> >       I don't think that anyone has picked up on this problem.
> >
> >       What's peculiar about your weights is that several are 0 within
> > rounding error but not exactly 0:
> >
> >       > head(df)
> >                  y          x       weight
> >       1  1.5115614  0.5520924 2.117337e-34
> >       2 -0.6365313 -0.1259932 2.117337e-34
> >       3  0.3778278  0.4209538 4.934135e-31
> >       4  3.0379232  1.4031545 2.679495e-24
> >       5  1.5364652  0.4607686 2.679495e-24
> >       6 -2.3772787 -0.7396358 6.244160e-21
> >
> >       I can reproduce the results that you report:
> >
> >       > (mod.1 <- lm(y ~ x, data=df))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.04173      2.03790
> >
> >       > max(resid(mod.1))
> >       [1] 1.14046
> >       > (mod.2 <- lm(y ~ x, data=df, weights=weight))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df, weights = weight)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.05786      1.96087
> >
> >       > max(resid(mod.2))
> >       [1] 36.84939
> >
> >       But the problem disappears when the tiny nonzero weight are set
> to 0:
> >
> >       > df2 <- df
> >       > df2$weight <- zapsmall(df2$weight)
> >       > head(df2)
> >                  y          x weight
> >       1  1.5115614  0.5520924      0
> >       2 -0.6365313 -0.1259932      0
> >       3  0.3778278  0.4209538      0
> >       4  3.0379232  1.4031545      0
> >       5  1.5364652  0.4607686      0
> >       6 -2.3772787 -0.7396358      0
> >       > (mod.3 <- update(mod.2, data=df2))
> >
> >       Call:
> >       lm(formula = y ~ x, data = df2, weights = weight)
> >
> >       Coefficients:
> >       (Intercept)            x
> >          -0.05786      1.96087
> >
> >       > max(resid(mod.3))
> >       [1] 1.146663
> >
> >       I don't know exactly why this happens, but suspect numerical
> > instability produced by the near-zero weights, which are smaller
> than the
> > machine double-epsilon
> >
> >       > .Machine$double.neg.eps
> >       [1] 1.110223e-16
> >
> >       The problem also disappears, e.g., if the tiny weight are set to 1e-
> 15
> > rather than 0.
> >
> >       I hope this helps,
> >        John
> >
> >       -----------------------------------------------------------------
> >       John Fox
> >       Professor Emeritus
> >       McMaster University
> >       Hamilton, Ontario, Canada
> >       Web: https://socialsciences.mcmaster.ca/jfox/
> >
> >
> >
> >       > -----Original Message-----
> >       > From: R-help [mailto:[hidden email] <mailto:r-
> [hidden email]>  <mailto:r-help- <mailto:r-help->
> > [hidden email] <mailto:[hidden email]> > ] On
> Behalf Of Hamed Ha
> >       > Sent: Tuesday, September 11, 2018 8:39 AM
> >       > To: [hidden email] <mailto:[hidden email]>
> <mailto:[hidden email] <mailto:[hidden email]> >
> >       > Subject: [R] Problem with lm.resid() when weights are provided
> >       >
> >       > Dear R Help Team.
> >       >
> >       > I get some weird results when I use the lm function with weight.
> The
> > issue can
> >       > be reproduced by the example below:
> >       >
> >       >
> >       > The input data is (weights are intentionally designed to reflect
> some
> >       > structures in the data)
> >       >
> >       >
> >       > > df
> >       > y x weight
> >       >  1.51156139  0.55209240 2.117337e-34
> >       > -0.63653132 -0.12599316 2.117337e-34
> >       >  0.37782776  0.42095384 4.934135e-31
> >       >  3.03792318  1.40315446 2.679495e-24
> >       >  1.53646523  0.46076858 2.679495e-24
> >       > -2.37727874 -0.73963576 6.244160e-21
> >       >  0.37183065  0.20407468 1.455107e-17
> >       > -1.53917553 -0.95519361 1.455107e-17
> >       >  1.10926675  0.03897129 3.390908e-14
> >       > -0.37786333 -0.17523593 3.390908e-14
> >       >  2.43973603  0.97970095 7.902000e-11
> >       > -0.35432394 -0.03742559 7.902000e-11
> >       >  2.19296613  1.00355263 4.289362e-04
> >       >  0.49845532  0.34816207 4.289362e-04
> >       >  1.25005260  0.76306225 5.000000e-01
> >       >  0.84360691  0.45152356 5.000000e-01
> >       >  0.29565993  0.53880068 5.000000e-01
> >       > -0.54081334 -0.28104525 5.000000e-01
> >       >  0.83612836 -0.12885659 9.995711e-01
> >       > -1.42526769 -0.87107631 9.999998e-01
> >       >  0.10204789 -0.11649899 1.000000e+00
> >       >  1.14292898  0.37249631 1.000000e+00
> >       > -3.02942081 -1.28966997 1.000000e+00
> >       > -1.37549764 -0.74676145 1.000000e+00
> >       > -2.00118016 -0.55182759 1.000000e+00
> >       > -4.24441674 -1.94603608 1.000000e+00
> >       >  1.17168144  1.00868008 1.000000e+00
> >       >  2.64007761  1.26333069 1.000000e+00
> >       >  1.98550114  1.18509599 1.000000e+00
> >       > -0.58941683 -0.61972416 9.999998e-01
> >       > -4.57559611 -2.30914920 9.995711e-01
> >       > -0.82610544 -0.39347576 9.995711e-01
> >       > -0.02768220  0.20076910 9.995711e-01
> >       >  0.78186399  0.25690215 9.995711e-01
> >       > -0.88314153 -0.20200148 5.000000e-01
> >       > -4.17076452 -2.03547588 5.000000e-01
> >       >  0.93373070  0.54190626 4.289362e-04
> >       > -0.08517734  0.17692491 4.289362e-04
> >       > -4.47546619 -2.14876688 4.289362e-04
> >       > -1.65509103 -0.76898087 4.289362e-04
> >       > -0.39403030 -0.12689705 4.289362e-04
> >       >  0.01203300 -0.18689898 1.841442e-07
> >       > -4.82762639 -2.31391121 1.841442e-07
> >       > -0.72658380 -0.39751171 3.397282e-14
> >       > -2.35886866 -1.01082109 0.000000e+00
> >       > -2.03762707 -0.96439902 0.000000e+00
> >       >  0.90115123  0.60172286 0.000000e+00
> >       >  1.55999194  0.83433953 0.000000e+00
> >       >  3.07994058  1.30942776 0.000000e+00
> >       >  1.78871462  1.10605530 0.000000e+00
> >       >
> >       >
> >       >
> >       > Running simple linear model returns:
> >       >
> >       > > lm(y~x,data=df)
> >       >
> >       > Call:
> >       > lm(formula = y ~ x, data = df)
> >       >
> >       > Coefficients:
> >       > (Intercept)            x
> >       >    -0.04173      2.03790
> >       >
> >       > and
> >       > > max(resid(lm(y~x,data=df)))
> >       > [1] 1.14046
> >       >
> >       >
> >       > *HOWEVER if I use the weighted model then:*
> >       >
> >       > lm(formula = y ~ x, data = df, weights = df$weights)
> >       >
> >       > Coefficients:
> >       > (Intercept)            x
> >       >    -0.05786      1.96087
> >       >
> >       > and
> >       > > max(resid(lm(y~x,data=df,weights=df$weights)))
> >       > [1] 60.91888
> >       >
> >       >
> >       > as you see, the estimation of the coefficients are nearly the
> same
> > but the
> >       > resid() function returns a giant residual (I have some cases
> where
> > the value is
> >       > much much higher). Further, if I calculate the residuals by
> simply
> >       > predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the
> true
> > value for
> >       > the residuals.
> >       >
> >       >
> >       > Thanks.
> >       >
> >       > Please do not hesitate to contact me for more details.
> >       > Regards,
> >       > Hamed.
> >       >
> >       >       [[alternative HTML version deleted]]
> >       >
> >       > ______________________________________________
> >       > [hidden email] <mailto:[hidden email]>
> <mailto:[hidden email] <mailto:[hidden email]> >  mailing list --
> > To UNSUBSCRIBE and more, see
> >       > https://stat.ethz.ch/mailman/listinfo/r-help
> >       > PLEASE do read the posting guide http://www.R-
> project.org/posting-
> >       > guide.html
> >       > and provide commented, minimal, self-contained, reproducible
> > code.
> >
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.