Indeed: Using 'weights' is not meant to indicate that the same
observation is repeated 'n' times. As I showed, this gives erroneous results. Hence I suggested that it is discouraged rather than encouraged in the Details section of lm in the Reference manual. Arie ---Original Message----- On Sat, 7 Oct 2017, [hidden email] wrote: Using 'weights' is not meant to indicate that the same observation is repeated 'n' times. It is meant to indicate different variances (or to be precise, that the variance of the last observation in 'x' is sigma^2 / n, while the first three observations have variance sigma^2). Best, Wolfgang -----Original Message----- From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate Sent: Saturday, 07 October, 2017 9:36 To: [hidden email] Subject: [Rd] Discourage the weights= option of lm with summarized data In the Details section of lm (linear models) in the Reference manual, it is suggested to use the weights= option for summarized data. This must be discouraged rather than encouraged. The motivation for this is as follows. With summarized data the standard errors get smaller with increasing numbers of observations. However, the standard errors in lm do not get smaller when for instance all weights are multiplied with the same constant larger than one, since the inverse weights are merely proportional to the error variances. Here is an example of the estimated standard errors being too large with the weights= option. The p value and the number of degrees of freedom are also wrong. The parameter estimates are correct. n <- 10 x <- c(1,2,3,4) y <- c(1,2,5,4) w <- c(1,1,1,n) xb <- c(x,rep(x[4],n-1)) # restore the original data yb <- c(y,rep(y[4],n-1)) print(summary(lm(yb ~ xb))) print(summary(lm(y ~ x, weights=w))) Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a FREQ statement (for summarized data). Arie ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Ah, I think you are referring to this part from ?lm:
"(including the case that there are w_i observations equal to y_i and the data have been summarized)" I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. Best, Wolfgang -----Original Message----- From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate Sent: Sunday, 08 October, 2017 14:55 To: [hidden email] Subject: [Rd] Discourage the weights= option of lm with summarized data Indeed: Using 'weights' is not meant to indicate that the same observation is repeated 'n' times. As I showed, this gives erroneous results. Hence I suggested that it is discouraged rather than encouraged in the Details section of lm in the Reference manual. Arie ---Original Message----- On Sat, 7 Oct 2017, [hidden email] wrote: Using 'weights' is not meant to indicate that the same observation is repeated 'n' times. It is meant to indicate different variances (or to be precise, that the variance of the last observation in 'x' is sigma^2 / n, while the first three observations have variance sigma^2). Best, Wolfgang -----Original Message----- From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate Sent: Saturday, 07 October, 2017 9:36 To: [hidden email] Subject: [Rd] Discourage the weights= option of lm with summarized data In the Details section of lm (linear models) in the Reference manual, it is suggested to use the weights= option for summarized data. This must be discouraged rather than encouraged. The motivation for this is as follows. With summarized data the standard errors get smaller with increasing numbers of observations. However, the standard errors in lm do not get smaller when for instance all weights are multiplied with the same constant larger than one, since the inverse weights are merely proportional to the error variances. Here is an example of the estimated standard errors being too large with the weights= option. The p value and the number of degrees of freedom are also wrong. The parameter estimates are correct. n <- 10 x <- c(1,2,3,4) y <- c(1,2,5,4) w <- c(1,1,1,n) xb <- c(x,rep(x[4],n-1)) # restore the original data yb <- c(y,rep(y[4],n-1)) print(summary(lm(yb ~ xb))) print(summary(lm(y ~ x, weights=w))) Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a FREQ statement (for summarized data). Arie ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Yes. Thank you; I should have quoted it.
I suggest to remove this text or to add the word "not" at the beginning. Arie On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) <[hidden email]> wrote: > Ah, I think you are referring to this part from ?lm: > > "(including the case that there are w_i observations equal to y_i and the data have been summarized)" > > I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. > > Best, > Wolfgang > > -----Original Message----- > From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate > Sent: Sunday, 08 October, 2017 14:55 > To: [hidden email] > Subject: [Rd] Discourage the weights= option of lm with summarized data > > Indeed: Using 'weights' is not meant to indicate that the same > observation is repeated 'n' times. As I showed, this gives erroneous > results. Hence I suggested that it is discouraged rather than > encouraged in the Details section of lm in the Reference manual. > > Arie > > ---Original Message----- > On Sat, 7 Oct 2017, [hidden email] wrote: > > Using 'weights' is not meant to indicate that the same observation is > repeated 'n' times. It is meant to indicate different variances (or to > be precise, that the variance of the last observation in 'x' is > sigma^2 / n, while the first three observations have variance > sigma^2). > > Best, > Wolfgang > > -----Original Message----- > From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate > Sent: Saturday, 07 October, 2017 9:36 > To: [hidden email] > Subject: [Rd] Discourage the weights= option of lm with summarized data > > In the Details section of lm (linear models) in the Reference manual, > it is suggested to use the weights= option for summarized data. This > must be discouraged rather than encouraged. The motivation for this is > as follows. > > With summarized data the standard errors get smaller with increasing > numbers of observations. However, the standard errors in lm do not get > smaller when for instance all weights are multiplied with the same > constant larger than one, since the inverse weights are merely > proportional to the error variances. > > Here is an example of the estimated standard errors being too large > with the weights= option. The p value and the number of degrees of > freedom are also wrong. The parameter estimates are correct. > > n <- 10 > x <- c(1,2,3,4) > y <- c(1,2,5,4) > w <- c(1,1,1,n) > xb <- c(x,rep(x[4],n-1)) # restore the original data > yb <- c(y,rep(y[4],n-1)) > print(summary(lm(yb ~ xb))) > print(summary(lm(y ~ x, weights=w))) > > Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a > FREQ statement (for summarized data). > > Arie > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
AFAIR, it is a little more subtle than that.
If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like y <- c(0,1) w <- c(49,51) glm(y~1, weights=w, family=binomial) -pd > On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: > > Yes. Thank you; I should have quoted it. > I suggest to remove this text or to add the word "not" at the beginning. > > Arie > > On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) > <[hidden email]> wrote: >> Ah, I think you are referring to this part from ?lm: >> >> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >> >> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >> >> Best, >> Wolfgang >> >> -----Original Message----- >> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >> Sent: Sunday, 08 October, 2017 14:55 >> To: [hidden email] >> Subject: [Rd] Discourage the weights= option of lm with summarized data >> >> Indeed: Using 'weights' is not meant to indicate that the same >> observation is repeated 'n' times. As I showed, this gives erroneous >> results. Hence I suggested that it is discouraged rather than >> encouraged in the Details section of lm in the Reference manual. >> >> Arie >> >> ---Original Message----- >> On Sat, 7 Oct 2017, [hidden email] wrote: >> >> Using 'weights' is not meant to indicate that the same observation is >> repeated 'n' times. It is meant to indicate different variances (or to >> be precise, that the variance of the last observation in 'x' is >> sigma^2 / n, while the first three observations have variance >> sigma^2). >> >> Best, >> Wolfgang >> >> -----Original Message----- >> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >> Sent: Saturday, 07 October, 2017 9:36 >> To: [hidden email] >> Subject: [Rd] Discourage the weights= option of lm with summarized data >> >> In the Details section of lm (linear models) in the Reference manual, >> it is suggested to use the weights= option for summarized data. This >> must be discouraged rather than encouraged. The motivation for this is >> as follows. >> >> With summarized data the standard errors get smaller with increasing >> numbers of observations. However, the standard errors in lm do not get >> smaller when for instance all weights are multiplied with the same >> constant larger than one, since the inverse weights are merely >> proportional to the error variances. >> >> Here is an example of the estimated standard errors being too large >> with the weights= option. The p value and the number of degrees of >> freedom are also wrong. The parameter estimates are correct. >> >> n <- 10 >> x <- c(1,2,3,4) >> y <- c(1,2,5,4) >> w <- c(1,1,1,n) >> xb <- c(x,rep(x[4],n-1)) # restore the original data >> yb <- c(y,rep(y[4],n-1)) >> print(summary(lm(yb ~ xb))) >> print(summary(lm(y ~ x, weights=w))) >> >> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >> FREQ statement (for summarized data). >> >> Arie >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
OK. We have now three suggestions to repair the text:
- remove the text - add "not" at the beginning of the text - add at the end of the text a warning; something like: "Note that in this case the standard estimates of the parameters are in general not correct, and hence also the t values and the p value. Also the number of degrees of freedom is not correct. (The parameter values are correct.)" A remark about the glm example: the Reference manual says: "For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes ....". Hence in the binomial case the weights are frequencies. With y <- 0.51 and w <- 100 you get the same result. Arie On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: > AFAIR, it is a little more subtle than that. > > If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. > > It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like > > y <- c(0,1) > w <- c(49,51) > glm(y~1, weights=w, family=binomial) > > -pd > >> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >> >> Yes. Thank you; I should have quoted it. >> I suggest to remove this text or to add the word "not" at the beginning. >> >> Arie >> >> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >> <[hidden email]> wrote: >>> Ah, I think you are referring to this part from ?lm: >>> >>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>> >>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>> >>> Best, >>> Wolfgang >>> >>> -----Original Message----- >>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>> Sent: Sunday, 08 October, 2017 14:55 >>> To: [hidden email] >>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>> >>> Indeed: Using 'weights' is not meant to indicate that the same >>> observation is repeated 'n' times. As I showed, this gives erroneous >>> results. Hence I suggested that it is discouraged rather than >>> encouraged in the Details section of lm in the Reference manual. >>> >>> Arie >>> >>> ---Original Message----- >>> On Sat, 7 Oct 2017, [hidden email] wrote: >>> >>> Using 'weights' is not meant to indicate that the same observation is >>> repeated 'n' times. It is meant to indicate different variances (or to >>> be precise, that the variance of the last observation in 'x' is >>> sigma^2 / n, while the first three observations have variance >>> sigma^2). >>> >>> Best, >>> Wolfgang >>> >>> -----Original Message----- >>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>> Sent: Saturday, 07 October, 2017 9:36 >>> To: [hidden email] >>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>> >>> In the Details section of lm (linear models) in the Reference manual, >>> it is suggested to use the weights= option for summarized data. This >>> must be discouraged rather than encouraged. The motivation for this is >>> as follows. >>> >>> With summarized data the standard errors get smaller with increasing >>> numbers of observations. However, the standard errors in lm do not get >>> smaller when for instance all weights are multiplied with the same >>> constant larger than one, since the inverse weights are merely >>> proportional to the error variances. >>> >>> Here is an example of the estimated standard errors being too large >>> with the weights= option. The p value and the number of degrees of >>> freedom are also wrong. The parameter estimates are correct. >>> >>> n <- 10 >>> x <- c(1,2,3,4) >>> y <- c(1,2,5,4) >>> w <- c(1,1,1,n) >>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>> yb <- c(y,rep(y[4],n-1)) >>> print(summary(lm(yb ~ xb))) >>> print(summary(lm(y ~ x, weights=w))) >>> >>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>> FREQ statement (for summarized data). >>> >>> Arie >>> >>> ______________________________________________ >>> [hidden email] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: [hidden email] Priv: [hidden email] > > > > > > > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Since the three posters agree (only) that there is a bug, I propose to
file it as a bug, which is the least we can do now. There is more to it: the only other case of a change in the Reference Manual which I know of, is also about the weights option! This is in coxph. The Reference Manual version 3.0.0 (2013) says about coxph: " ... If weights is a vector of integers, then the estimated coefficients are equivalent to estimating the model from data with the individual cases replicated as many times as indicated by weights." This is not true, as can be seen from the following code, which uses the data from the first example in the Reference Manual of coxph: library(survival) print(df1 <- as.data.frame(list( time=c(4,3,1,1,2,2,3), status=c(1,1,1,0,1,1,0), x=c(0,2,1,1,1,0,0), sex=c(0,0,0,0,1,1,1) ))) print(w <- rep(2,7)) print(coxph(Surv(time,status) ~ x + strata(sex),data=df1,weights=w)) # manually doubling the data: print(df2 <- rbind(df1,df1)) print(coxph(Surv(time,status) ~ x + strata(sex), data=df2)) This should not come as a surprise, since with coxph the computation of the likelihood (given the parameters) for a single observation uses also the other observations. This bug has been repaired. The present Reference Manual of coxph says that the weights option specifies a vector of case weights, to which is added only: "For a thorough discussion of these see the book by Therneau and Grambsch." Let us repair the other bug also. Arie On Thu, Oct 12, 2017 at 1:48 PM, Arie ten Cate <[hidden email]> wrote: > OK. We have now three suggestions to repair the text: > - remove the text > - add "not" at the beginning of the text > - add at the end of the text a warning; something like: > > "Note that in this case the standard estimates of the parameters are > in general not correct, and hence also the t values and the p value. > Also the number of degrees of freedom is not correct. (The parameter > values are correct.)" > > A remark about the glm example: the Reference manual says: "For a > binomial GLM prior weights are used to give the number of trials when > the response is the proportion of successes ....". Hence in the > binomial case the weights are frequencies. > With y <- 0.51 and w <- 100 you get the same result. > > Arie > > On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: >> AFAIR, it is a little more subtle than that. >> >> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. >> >> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like >> >> y <- c(0,1) >> w <- c(49,51) >> glm(y~1, weights=w, family=binomial) >> >> -pd >> >>> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >>> >>> Yes. Thank you; I should have quoted it. >>> I suggest to remove this text or to add the word "not" at the beginning. >>> >>> Arie >>> >>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>> <[hidden email]> wrote: >>>> Ah, I think you are referring to this part from ?lm: >>>> >>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>>> >>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>>> >>>> Best, >>>> Wolfgang >>>> >>>> -----Original Message----- >>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>> Sent: Sunday, 08 October, 2017 14:55 >>>> To: [hidden email] >>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>> >>>> Indeed: Using 'weights' is not meant to indicate that the same >>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>> results. Hence I suggested that it is discouraged rather than >>>> encouraged in the Details section of lm in the Reference manual. >>>> >>>> Arie >>>> >>>> ---Original Message----- >>>> On Sat, 7 Oct 2017, [hidden email] wrote: >>>> >>>> Using 'weights' is not meant to indicate that the same observation is >>>> repeated 'n' times. It is meant to indicate different variances (or to >>>> be precise, that the variance of the last observation in 'x' is >>>> sigma^2 / n, while the first three observations have variance >>>> sigma^2). >>>> >>>> Best, >>>> Wolfgang >>>> >>>> -----Original Message----- >>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>> Sent: Saturday, 07 October, 2017 9:36 >>>> To: [hidden email] >>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>> >>>> In the Details section of lm (linear models) in the Reference manual, >>>> it is suggested to use the weights= option for summarized data. This >>>> must be discouraged rather than encouraged. The motivation for this is >>>> as follows. >>>> >>>> With summarized data the standard errors get smaller with increasing >>>> numbers of observations. However, the standard errors in lm do not get >>>> smaller when for instance all weights are multiplied with the same >>>> constant larger than one, since the inverse weights are merely >>>> proportional to the error variances. >>>> >>>> Here is an example of the estimated standard errors being too large >>>> with the weights= option. The p value and the number of degrees of >>>> freedom are also wrong. The parameter estimates are correct. >>>> >>>> n <- 10 >>>> x <- c(1,2,3,4) >>>> y <- c(1,2,5,4) >>>> w <- c(1,1,1,n) >>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>> yb <- c(y,rep(y[4],n-1)) >>>> print(summary(lm(yb ~ xb))) >>>> print(summary(lm(y ~ x, weights=w))) >>>> >>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>> FREQ statement (for summarized data). >>>> >>>> Arie >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> ______________________________________________ >>> [hidden email] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Office: A 4.23 >> Email: [hidden email] Priv: [hidden email] >> >> ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
It's on my todo list (for R-devel, it is not _that_ important), other things just keep taking priority...
-pd > On 28 Nov 2017, at 09:29 , Arie ten Cate <[hidden email]> wrote: > > Since the three posters agree (only) that there is a bug, I propose to > file it as a bug, which is the least we can do now. > > There is more to it: the only other case of a change in the Reference > Manual which I know of, is also about the weights option! This is in > coxph. The Reference Manual version 3.0.0 (2013) says about coxph: > > " ... If weights is a vector of integers, then the estimated > coefficients are equivalent to estimating the model from data with the > individual cases replicated as many times as indicated by weights." > > This is not true, as can be seen from the following code, which uses > the data from the first example in the Reference Manual of coxph: > > library(survival) > print(df1 <- as.data.frame(list( > time=c(4,3,1,1,2,2,3), > status=c(1,1,1,0,1,1,0), > x=c(0,2,1,1,1,0,0), > sex=c(0,0,0,0,1,1,1) > ))) > print(w <- rep(2,7)) > print(coxph(Surv(time,status) ~ x + strata(sex),data=df1,weights=w)) > # manually doubling the data: > print(df2 <- rbind(df1,df1)) > print(coxph(Surv(time,status) ~ x + strata(sex), data=df2)) > > This should not come as a surprise, since with coxph the computation > of the likelihood (given the parameters) for a single observation uses > also the other observations. > > This bug has been repaired. The present Reference Manual of coxph says > that the weights option specifies a vector of case weights, to which > is added only: "For a thorough discussion of these see the book by > Therneau and Grambsch." > > Let us repair the other bug also. > > Arie > > On Thu, Oct 12, 2017 at 1:48 PM, Arie ten Cate <[hidden email]> wrote: >> OK. We have now three suggestions to repair the text: >> - remove the text >> - add "not" at the beginning of the text >> - add at the end of the text a warning; something like: >> >> "Note that in this case the standard estimates of the parameters are >> in general not correct, and hence also the t values and the p value. >> Also the number of degrees of freedom is not correct. (The parameter >> values are correct.)" >> >> A remark about the glm example: the Reference manual says: "For a >> binomial GLM prior weights are used to give the number of trials when >> the response is the proportion of successes ....". Hence in the >> binomial case the weights are frequencies. >> With y <- 0.51 and w <- 100 you get the same result. >> >> Arie >> >> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: >>> AFAIR, it is a little more subtle than that. >>> >>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. >>> >>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like >>> >>> y <- c(0,1) >>> w <- c(49,51) >>> glm(y~1, weights=w, family=binomial) >>> >>> -pd >>> >>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >>>> >>>> Yes. Thank you; I should have quoted it. >>>> I suggest to remove this text or to add the word "not" at the beginning. >>>> >>>> Arie >>>> >>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>>> <[hidden email]> wrote: >>>>> Ah, I think you are referring to this part from ?lm: >>>>> >>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>>>> >>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>>>> >>>>> Best, >>>>> Wolfgang >>>>> >>>>> -----Original Message----- >>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>> Sent: Sunday, 08 October, 2017 14:55 >>>>> To: [hidden email] >>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>> >>>>> Indeed: Using 'weights' is not meant to indicate that the same >>>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>>> results. Hence I suggested that it is discouraged rather than >>>>> encouraged in the Details section of lm in the Reference manual. >>>>> >>>>> Arie >>>>> >>>>> ---Original Message----- >>>>> On Sat, 7 Oct 2017, [hidden email] wrote: >>>>> >>>>> Using 'weights' is not meant to indicate that the same observation is >>>>> repeated 'n' times. It is meant to indicate different variances (or to >>>>> be precise, that the variance of the last observation in 'x' is >>>>> sigma^2 / n, while the first three observations have variance >>>>> sigma^2). >>>>> >>>>> Best, >>>>> Wolfgang >>>>> >>>>> -----Original Message----- >>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>> Sent: Saturday, 07 October, 2017 9:36 >>>>> To: [hidden email] >>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>> >>>>> In the Details section of lm (linear models) in the Reference manual, >>>>> it is suggested to use the weights= option for summarized data. This >>>>> must be discouraged rather than encouraged. The motivation for this is >>>>> as follows. >>>>> >>>>> With summarized data the standard errors get smaller with increasing >>>>> numbers of observations. However, the standard errors in lm do not get >>>>> smaller when for instance all weights are multiplied with the same >>>>> constant larger than one, since the inverse weights are merely >>>>> proportional to the error variances. >>>>> >>>>> Here is an example of the estimated standard errors being too large >>>>> with the weights= option. The p value and the number of degrees of >>>>> freedom are also wrong. The parameter estimates are correct. >>>>> >>>>> n <- 10 >>>>> x <- c(1,2,3,4) >>>>> y <- c(1,2,5,4) >>>>> w <- c(1,1,1,n) >>>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>>> yb <- c(y,rep(y[4],n-1)) >>>>> print(summary(lm(yb ~ xb))) >>>>> print(summary(lm(y ~ x, weights=w))) >>>>> >>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>>> FREQ statement (for summarized data). >>>>> >>>>> Arie >>>>> >>>>> ______________________________________________ >>>>> [hidden email] mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> -- >>> Peter Dalgaard, Professor, >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Office: A 4.23 >>> Email: [hidden email] Priv: [hidden email] >>> >>> > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by Arie ten Cate
My local R-devel version now has (in ?lm)
Non-‘NULL’ ‘weights’ can be used to indicate that different observations have different variances (with the values in ‘weights’ being inversely proportional to the variances); or equivalently, when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care. OK? -pd > On 12 Oct 2017, at 13:48 , Arie ten Cate <[hidden email]> wrote: > > OK. We have now three suggestions to repair the text: > - remove the text > - add "not" at the beginning of the text > - add at the end of the text a warning; something like: > > "Note that in this case the standard estimates of the parameters are > in general not correct, and hence also the t values and the p value. > Also the number of degrees of freedom is not correct. (The parameter > values are correct.)" > > A remark about the glm example: the Reference manual says: "For a > binomial GLM prior weights are used to give the number of trials when > the response is the proportion of successes ....". Hence in the > binomial case the weights are frequencies. > With y <- 0.51 and w <- 100 you get the same result. > > Arie > > On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: >> AFAIR, it is a little more subtle than that. >> >> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. >> >> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like >> >> y <- c(0,1) >> w <- c(49,51) >> glm(y~1, weights=w, family=binomial) >> >> -pd >> >>> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >>> >>> Yes. Thank you; I should have quoted it. >>> I suggest to remove this text or to add the word "not" at the beginning. >>> >>> Arie >>> >>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>> <[hidden email]> wrote: >>>> Ah, I think you are referring to this part from ?lm: >>>> >>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>>> >>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>>> >>>> Best, >>>> Wolfgang >>>> >>>> -----Original Message----- >>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>> Sent: Sunday, 08 October, 2017 14:55 >>>> To: [hidden email] >>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>> >>>> Indeed: Using 'weights' is not meant to indicate that the same >>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>> results. Hence I suggested that it is discouraged rather than >>>> encouraged in the Details section of lm in the Reference manual. >>>> >>>> Arie >>>> >>>> ---Original Message----- >>>> On Sat, 7 Oct 2017, [hidden email] wrote: >>>> >>>> Using 'weights' is not meant to indicate that the same observation is >>>> repeated 'n' times. It is meant to indicate different variances (or to >>>> be precise, that the variance of the last observation in 'x' is >>>> sigma^2 / n, while the first three observations have variance >>>> sigma^2). >>>> >>>> Best, >>>> Wolfgang >>>> >>>> -----Original Message----- >>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>> Sent: Saturday, 07 October, 2017 9:36 >>>> To: [hidden email] >>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>> >>>> In the Details section of lm (linear models) in the Reference manual, >>>> it is suggested to use the weights= option for summarized data. This >>>> must be discouraged rather than encouraged. The motivation for this is >>>> as follows. >>>> >>>> With summarized data the standard errors get smaller with increasing >>>> numbers of observations. However, the standard errors in lm do not get >>>> smaller when for instance all weights are multiplied with the same >>>> constant larger than one, since the inverse weights are merely >>>> proportional to the error variances. >>>> >>>> Here is an example of the estimated standard errors being too large >>>> with the weights= option. The p value and the number of degrees of >>>> freedom are also wrong. The parameter estimates are correct. >>>> >>>> n <- 10 >>>> x <- c(1,2,3,4) >>>> y <- c(1,2,5,4) >>>> w <- c(1,1,1,n) >>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>> yb <- c(y,rep(y[4],n-1)) >>>> print(summary(lm(yb ~ xb))) >>>> print(summary(lm(y ~ x, weights=w))) >>>> >>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>> FREQ statement (for summarized data). >>>> >>>> Arie >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> ______________________________________________ >>> [hidden email] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Office: A 4.23 >> Email: [hidden email] Priv: [hidden email] >> >> >> >> >> >> >> >> >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Peter,
This is a highly structured text. Just for the discussion, I separate the building blocks, where (D) and (E) and (F) are new: BEGIN OF TEXT -------------------- (A) Non-‘NULL’ ‘weights’ can be used to indicate that different observations have different variances (with the values in ‘weights’ being inversely proportional to the variances); (B) or equivalently, when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (C) (including the case that there are w_i observations equal to y_i and the data have been summarized). (D) However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; (E) in the case of replication weights, even wrong. (F) Hence, standard errors and analysis of variance tables should be treated with care. END OF TEXT -------------------- I don't understand (D), partly because it is unclear to me whether (D) refers to (C) or to (B)+(C): If (D) refers only to (C), as the reader might automatically think with the repetition of the word "case", then it is unclear to me to what block does (E) refer. If, on the other hand, (D) refers to (B)+(C) then (E) probably refers to (C) and then I suggest to make this more clear by replacing "in the case of replication weights" in (E) by "in the case of summarized data". I suggest to change "even wrong" in (E) into the more down-to-earth "wrong". (For the record: I prefer something like my original explanation of the problem with (C), instead of (D)+(E)+(F): "With summarized data the standard errors get smaller with increasing numbers of observations w_i. However, when for instance all w_i are multiplied with the same constant larger than one, the reported standard errors do not get smaller since the w_i are defined apart from an arbitrary positive multiplicative constant. Hence the reported standard errors tend to be too large and the reported t values and the reported number of significance stars too small. Obviously, also the reported number of observations and the reported number of degrees of freedom are too small." Note that with heteroskedasticity, _the_ residual standard error has no meaning.) Finally, about the original text: (B) and (C) mention only y_i, not x_i, while this is about entire observations. Maybe this can remedied also? Arie On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard <[hidden email]> wrote: > My local R-devel version now has (in ?lm) > > Non-‘NULL’ ‘weights’ can be used to indicate that different > observations have different variances (with the values in > ‘weights’ being inversely proportional to the variances); or > equivalently, when the elements of ‘weights’ are positive integers > w_i, that each response y_i is the mean of w_i unit-weight > observations (including the case that there are w_i observations > equal to y_i and the data have been summarized). However, in the > latter case, notice that within-group variation is not used. > Therefore, the sigma estimate and residual degrees of freedom may > be suboptimal; in the case of replication weights, even wrong. > Hence, standard errors and analysis of variance tables should be > treated with care. > > OK? > > > -pd > > >> On 12 Oct 2017, at 13:48 , Arie ten Cate <[hidden email]> wrote: >> >> OK. We have now three suggestions to repair the text: >> - remove the text >> - add "not" at the beginning of the text >> - add at the end of the text a warning; something like: >> >> "Note that in this case the standard estimates of the parameters are >> in general not correct, and hence also the t values and the p value. >> Also the number of degrees of freedom is not correct. (The parameter >> values are correct.)" >> >> A remark about the glm example: the Reference manual says: "For a >> binomial GLM prior weights are used to give the number of trials when >> the response is the proportion of successes ....". Hence in the >> binomial case the weights are frequencies. >> With y <- 0.51 and w <- 100 you get the same result. >> >> Arie >> >> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: >>> AFAIR, it is a little more subtle than that. >>> >>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. >>> >>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like >>> >>> y <- c(0,1) >>> w <- c(49,51) >>> glm(y~1, weights=w, family=binomial) >>> >>> -pd >>> >>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >>>> >>>> Yes. Thank you; I should have quoted it. >>>> I suggest to remove this text or to add the word "not" at the beginning. >>>> >>>> Arie >>>> >>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>>> <[hidden email]> wrote: >>>>> Ah, I think you are referring to this part from ?lm: >>>>> >>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>>>> >>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>>>> >>>>> Best, >>>>> Wolfgang >>>>> >>>>> -----Original Message----- >>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>> Sent: Sunday, 08 October, 2017 14:55 >>>>> To: [hidden email] >>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>> >>>>> Indeed: Using 'weights' is not meant to indicate that the same >>>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>>> results. Hence I suggested that it is discouraged rather than >>>>> encouraged in the Details section of lm in the Reference manual. >>>>> >>>>> Arie >>>>> >>>>> ---Original Message----- >>>>> On Sat, 7 Oct 2017, [hidden email] wrote: >>>>> >>>>> Using 'weights' is not meant to indicate that the same observation is >>>>> repeated 'n' times. It is meant to indicate different variances (or to >>>>> be precise, that the variance of the last observation in 'x' is >>>>> sigma^2 / n, while the first three observations have variance >>>>> sigma^2). >>>>> >>>>> Best, >>>>> Wolfgang >>>>> >>>>> -----Original Message----- >>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>> Sent: Saturday, 07 October, 2017 9:36 >>>>> To: [hidden email] >>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>> >>>>> In the Details section of lm (linear models) in the Reference manual, >>>>> it is suggested to use the weights= option for summarized data. This >>>>> must be discouraged rather than encouraged. The motivation for this is >>>>> as follows. >>>>> >>>>> With summarized data the standard errors get smaller with increasing >>>>> numbers of observations. However, the standard errors in lm do not get >>>>> smaller when for instance all weights are multiplied with the same >>>>> constant larger than one, since the inverse weights are merely >>>>> proportional to the error variances. >>>>> >>>>> Here is an example of the estimated standard errors being too large >>>>> with the weights= option. The p value and the number of degrees of >>>>> freedom are also wrong. The parameter estimates are correct. >>>>> >>>>> n <- 10 >>>>> x <- c(1,2,3,4) >>>>> y <- c(1,2,5,4) >>>>> w <- c(1,1,1,n) >>>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>>> yb <- c(y,rep(y[4],n-1)) >>>>> print(summary(lm(yb ~ xb))) >>>>> print(summary(lm(y ~ x, weights=w))) >>>>> >>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>>> FREQ statement (for summarized data). >>>>> >>>>> Arie >>>>> >>>>> ______________________________________________ >>>>> [hidden email] mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> -- >>> Peter Dalgaard, Professor, >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Office: A 4.23 >>> Email: [hidden email] Priv: [hidden email] >>> > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
> On 3 Dec 2017, at 16:31 , Arie ten Cate <[hidden email]> wrote: > > Peter, > > This is a highly structured text. Just for the discussion, I separate > the building blocks, where (D) and (E) and (F) are new: > > BEGIN OF TEXT -------------------- > > (A) > > Non-‘NULL’ ‘weights’ can be used to indicate that different > observations have different variances (with the values in ‘weights’ > being inversely proportional to the variances); > > (B) > > or equivalently, when the elements of ‘weights’ are positive integers > w_i, that each response y_i is the mean of w_i unit-weight > observations > > (C) > > (including the case that there are w_i observations equal to y_i and > the data have been summarized). > > (D) > > However, in the latter case, notice that within-group variation is not > used. Therefore, the sigma estimate and residual degrees of freedom > may be suboptimal; > > (E) > > in the case of replication weights, even wrong. > > (F) > > Hence, standard errors and analysis of variance tables should be > treated with care. > > END OF TEXT -------------------- > > I don't understand (D), partly because it is unclear to me whether (D) > refers to (C) or to (B)+(C): B, including C, is "the latter case". > If (D) refers only to (C), as the reader might automatically think > with the repetition of the word "case", then it is unclear to me to > what block does (E) refer. Not so. If it did, it should go inside the parentheses. > If, on the other hand, (D) refers to (B)+(C) then (E) probably > refers to (C) and then I suggest to make this more clear by replacing > "in the case of replication weights" in (E) by "in the case of > summarized data". > That would be wrong. Data can be summarized by means of groups (and SDs, which are unused, hence the suboptimality), _including_ the case where all elements are identical. > I suggest to change "even wrong" in (E) into the more down-to-earth "wrong". That would seem to be a matter of taste. Howver, "equivalently" in (B) does not look right. > > (For the record: I prefer something like my original explanation of > the problem with (C), instead of (D)+(E)+(F): > "With summarized data the standard errors get smaller with > increasing numbers of observations w_i. However, when for instance all > w_i are multiplied with the same constant larger than one, the > reported standard errors do not get smaller since the w_i are defined > apart from an arbitrary positive multiplicative constant. Hence the > reported standard errors tend to be too large and the reported t > values and the reported number of significance stars too small. > Obviously, also the reported number of observations and the reported > number of degrees of freedom are too small." > Note that with heteroskedasticity, _the_ residual standard error > has no meaning.) > > Finally, about the original text: (B) and (C) mention only y_i, not > x_i, while this is about entire observations. Maybe this can remedied > also? > > Arie > > On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard <[hidden email]> wrote: >> My local R-devel version now has (in ?lm) >> >> Non-‘NULL’ ‘weights’ can be used to indicate that different >> observations have different variances (with the values in >> ‘weights’ being inversely proportional to the variances); or >> equivalently, when the elements of ‘weights’ are positive integers >> w_i, that each response y_i is the mean of w_i unit-weight >> observations (including the case that there are w_i observations >> equal to y_i and the data have been summarized). However, in the >> latter case, notice that within-group variation is not used. >> Therefore, the sigma estimate and residual degrees of freedom may >> be suboptimal; in the case of replication weights, even wrong. >> Hence, standard errors and analysis of variance tables should be >> treated with care. >> >> OK? >> >> >> -pd >> >> >>> On 12 Oct 2017, at 13:48 , Arie ten Cate <[hidden email]> wrote: >>> >>> OK. We have now three suggestions to repair the text: >>> - remove the text >>> - add "not" at the beginning of the text >>> - add at the end of the text a warning; something like: >>> >>> "Note that in this case the standard estimates of the parameters are >>> in general not correct, and hence also the t values and the p value. >>> Also the number of degrees of freedom is not correct. (The parameter >>> values are correct.)" >>> >>> A remark about the glm example: the Reference manual says: "For a >>> binomial GLM prior weights are used to give the number of trials when >>> the response is the proportion of successes ....". Hence in the >>> binomial case the weights are frequencies. >>> With y <- 0.51 and w <- 100 you get the same result. >>> >>> Arie >>> >>> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <[hidden email]> wrote: >>>> AFAIR, it is a little more subtle than that. >>>> >>>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this. >>>> >>>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like >>>> >>>> y <- c(0,1) >>>> w <- c(49,51) >>>> glm(y~1, weights=w, family=binomial) >>>> >>>> -pd >>>> >>>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <[hidden email]> wrote: >>>>> >>>>> Yes. Thank you; I should have quoted it. >>>>> I suggest to remove this text or to add the word "not" at the beginning. >>>>> >>>>> Arie >>>>> >>>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>>>> <[hidden email]> wrote: >>>>>> Ah, I think you are referring to this part from ?lm: >>>>>> >>>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)" >>>>>> >>>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make. >>>>>> >>>>>> Best, >>>>>> Wolfgang >>>>>> >>>>>> -----Original Message----- >>>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>>> Sent: Sunday, 08 October, 2017 14:55 >>>>>> To: [hidden email] >>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>>> >>>>>> Indeed: Using 'weights' is not meant to indicate that the same >>>>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>>>> results. Hence I suggested that it is discouraged rather than >>>>>> encouraged in the Details section of lm in the Reference manual. >>>>>> >>>>>> Arie >>>>>> >>>>>> ---Original Message----- >>>>>> On Sat, 7 Oct 2017, [hidden email] wrote: >>>>>> >>>>>> Using 'weights' is not meant to indicate that the same observation is >>>>>> repeated 'n' times. It is meant to indicate different variances (or to >>>>>> be precise, that the variance of the last observation in 'x' is >>>>>> sigma^2 / n, while the first three observations have variance >>>>>> sigma^2). >>>>>> >>>>>> Best, >>>>>> Wolfgang >>>>>> >>>>>> -----Original Message----- >>>>>> From: R-devel [mailto:[hidden email]] On Behalf Of Arie ten Cate >>>>>> Sent: Saturday, 07 October, 2017 9:36 >>>>>> To: [hidden email] >>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>>> >>>>>> In the Details section of lm (linear models) in the Reference manual, >>>>>> it is suggested to use the weights= option for summarized data. This >>>>>> must be discouraged rather than encouraged. The motivation for this is >>>>>> as follows. >>>>>> >>>>>> With summarized data the standard errors get smaller with increasing >>>>>> numbers of observations. However, the standard errors in lm do not get >>>>>> smaller when for instance all weights are multiplied with the same >>>>>> constant larger than one, since the inverse weights are merely >>>>>> proportional to the error variances. >>>>>> >>>>>> Here is an example of the estimated standard errors being too large >>>>>> with the weights= option. The p value and the number of degrees of >>>>>> freedom are also wrong. The parameter estimates are correct. >>>>>> >>>>>> n <- 10 >>>>>> x <- c(1,2,3,4) >>>>>> y <- c(1,2,5,4) >>>>>> w <- c(1,1,1,n) >>>>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>>>> yb <- c(y,rep(y[4],n-1)) >>>>>> print(summary(lm(yb ~ xb))) >>>>>> print(summary(lm(y ~ x, weights=w))) >>>>>> >>>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>>>> FREQ statement (for summarized data). >>>>>> >>>>>> Arie >>>>>> >>>>>> ______________________________________________ >>>>>> [hidden email] mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>> >>>>> ______________________________________________ >>>>> [hidden email] mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> -- >>>> Peter Dalgaard, Professor, >>>> Center for Statistics, Copenhagen Business School >>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>>> Phone: (+45)38153501 >>>> Office: A 4.23 >>>> Email: [hidden email] Priv: [hidden email] >>>> >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Office: A 4.23 >> Email: [hidden email] Priv: [hidden email] -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Free forum by Nabble | Edit this page |