Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or
3 Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives intercept -1 and slope 1.2 Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of length 6, with four of every six correct. Bug has been present across many versions. The machine I just tried it on just now has R3.2.3: _ platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 3 minor 2.3 year 2015 month 12 day 10 svn rev 69752 language R version.string R version 3.2.3 (2015-12-10) nickname Wooden Christmas-Tree [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Can confirm this in R 3.4.0 :
end <- 6:100 res <- lapply(end, function(i) line(1:i,1:i)) absresid <- sapply(res, function(i) mean(abs(resid(i)))) plot(absresid, type = "h") coefs <- sapply(res, coef) plot(coefs[1,], coefs[2,]) > sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.0 tools_3.4.0 On Sun, May 28, 2017 at 3:28 AM, GlenB <[hidden email]> wrote: > Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or > 3 > > Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives > intercept -1 and slope 1.2 > > Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of > length 6, with four of every six correct. > > Bug has been present across many versions. > > The machine I just tried it on just now has R3.2.3: > > _ > platform x86_64-w64-mingw32 > arch x86_64 > os mingw32 > system x86_64, mingw32 > status > major 3 > minor 2.3 > year 2015 > month 12 > day 10 > svn rev 69752 > language R > version.string R version 3.2.3 (2015-12-10) > nickname Wooden Christmas-Tree > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 (0)9 264 61 79 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by GlenB
On 27/05/2017 9:28 PM, GlenB wrote:
> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or > 3 > > Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives > intercept -1 and slope 1.2 > > Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of > length 6, with four of every six correct. > > Bug has been present across many versions. > > The machine I just tried it on just now has R3.2.3: If you look at the source (in src/library/stats/src/line.c), the explanation is clear: the x value is chosen as the 1/6 quantile (according to a particular definition of quantile), and the y value is chosen as the median of the y values where x is less than or equal to the 1/3 quantile. Those are different definitions (though I think they would be asymptotically equivalent under pretty weak assumptions), so it's not surprising the x value doesn't correspond perfectly to the y value, and the line ends up "wrong". So is it a bug? Well, that depends on Tukey's definition. I don't have a copy of his book handy so I can't really say. Maybe the R function is doing exactly what Tukey said it should, and that's not a bug. Or maybe R is wrong. Duncan Murdoch ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Tukey divides the points into three groups, not the x and y values
separately. I'll try to get hold of the book for a direct quote, might take a couple of days. On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> wrote: > On 27/05/2017 9:28 PM, GlenB wrote: > >> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >> or >> 3 >> >> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >> intercept -1 and slope 1.2 >> >> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of >> length 6, with four of every six correct. >> >> Bug has been present across many versions. >> >> The machine I just tried it on just now has R3.2.3: >> > > If you look at the source (in src/library/stats/src/line.c), the > explanation is clear: the x value is chosen as the 1/6 quantile (according > to a particular definition of quantile), and the y value is chosen as the > median of the y values where x is less than or equal to the 1/3 quantile. > Those are different definitions (though I think they would be > asymptotically equivalent under pretty weak assumptions), so it's not > surprising the x value doesn't correspond perfectly to the y value, and the > line ends up "wrong". > > So is it a bug? Well, that depends on Tukey's definition. I don't have a > copy of his book handy so I can't really say. Maybe the R function is > doing exactly what Tukey said it should, and that's not a bug. Or maybe R > is wrong. > > Duncan Murdoch > > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
> Tukey divides the points into three groups, not the x and y values
separately. > I'll try to get hold of the book for a direct quote, might take a couple of days. Ah well, I can't get it for a week. But the fact that it's often called Tukey's three group line (try a search on *tukey three group line* and you'll get plenty of hits) is pretty much a giveaway. On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: > Tukey divides the points into three groups, not the x and y values > separately. > > I'll try to get hold of the book for a direct quote, might take a couple > of days. > > > > On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> > wrote: > >> On 27/05/2017 9:28 PM, GlenB wrote: >> >>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>> or >>> 3 >>> >>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>> intercept -1 and slope 1.2 >>> >>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>> of >>> length 6, with four of every six correct. >>> >>> Bug has been present across many versions. >>> >>> The machine I just tried it on just now has R3.2.3: >>> >> >> If you look at the source (in src/library/stats/src/line.c), the >> explanation is clear: the x value is chosen as the 1/6 quantile (according >> to a particular definition of quantile), and the y value is chosen as the >> median of the y values where x is less than or equal to the 1/3 quantile. >> Those are different definitions (though I think they would be >> asymptotically equivalent under pretty weak assumptions), so it's not >> surprising the x value doesn't correspond perfectly to the y value, and the >> line ends up "wrong". >> >> So is it a bug? Well, that depends on Tukey's definition. I don't have >> a copy of his book handy so I can't really say. Maybe the R function is >> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >> is wrong. >> >> Duncan Murdoch >> >> > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:
https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). Notice, incidentally, that > line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) Call: line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) Coefficients: [1] -0.9407 1.1948 I.e., it is not likely an issue with exact integers or perfect fit. -pd > On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote: > >> Tukey divides the points into three groups, not the x and y values > separately. > >> I'll try to get hold of the book for a direct quote, might take a couple > of days. > > Ah well, I can't get it for a week. But the fact that it's often called > Tukey's three group line (try a search on *tukey three group line* and > you'll get plenty of hits) is pretty much a giveaway. > > > On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: > >> Tukey divides the points into three groups, not the x and y values >> separately. >> >> I'll try to get hold of the book for a direct quote, might take a couple >> of days. >> >> >> >> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> >> wrote: >> >>> On 27/05/2017 9:28 PM, GlenB wrote: >>> >>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>> or >>>> 3 >>>> >>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>> intercept -1 and slope 1.2 >>>> >>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>> of >>>> length 6, with four of every six correct. >>>> >>>> Bug has been present across many versions. >>>> >>>> The machine I just tried it on just now has R3.2.3: >>>> >>> >>> If you look at the source (in src/library/stats/src/line.c), the >>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>> to a particular definition of quantile), and the y value is chosen as the >>> median of the y values where x is less than or equal to the 1/3 quantile. >>> Those are different definitions (though I think they would be >>> asymptotically equivalent under pretty weak assumptions), so it's not >>> surprising the x value doesn't correspond perfectly to the y value, and the >>> line ends up "wrong". >>> >>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>> a copy of his book handy so I can't really say. Maybe the R function is >>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>> is wrong. >>> >>> Duncan Murdoch >>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) which reveals not to be true despite very trustful appearance. If we continue with the example of x=y=1:9 then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) while median(y[i] | x[i] <= quantile(x, 1/3))=2 On the other sample's side we've got 7.5 and 8 for x and y respectively. Hence the slope (8-2)/(7.5-2.5)=1.2 To get a correct version of this, one should calculate x robust points in the same way as the y's, i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) Best, Serguei. Le 29/05/2017 à 10:02, peter dalgaard a écrit : > A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: > > https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line > > This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). > > Notice, incidentally, that > >> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) > Call: > line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) > > Coefficients: > [1] -0.9407 1.1948 > > I.e., it is not likely an issue with exact integers or perfect fit. > > -pd > > > >> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote: >> >>> Tukey divides the points into three groups, not the x and y values >> separately. >> >>> I'll try to get hold of the book for a direct quote, might take a couple >> of days. >> >> Ah well, I can't get it for a week. But the fact that it's often called >> Tukey's three group line (try a search on *tukey three group line* and >> you'll get plenty of hits) is pretty much a giveaway. >> >> >> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: >> >>> Tukey divides the points into three groups, not the x and y values >>> separately. >>> >>> I'll try to get hold of the book for a direct quote, might take a couple >>> of days. >>> >>> >>> >>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> >>> wrote: >>> >>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>> >>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>> or >>>>> 3 >>>>> >>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>> intercept -1 and slope 1.2 >>>>> >>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>> of >>>>> length 6, with four of every six correct. >>>>> >>>>> Bug has been present across many versions. >>>>> >>>>> The machine I just tried it on just now has R3.2.3: >>>>> >>>> If you look at the source (in src/library/stats/src/line.c), the >>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>> to a particular definition of quantile), and the y value is chosen as the >>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>> Those are different definitions (though I think they would be >>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>> line ends up "wrong". >>>> >>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>> a copy of his book handy so I can't really say. Maybe the R function is >>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>> is wrong. >>>> >>>> Duncan Murdoch >>>> >>>> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel -- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: [hidden email] http://metasys.insa-toulouse.fr http://www.lisbp.fr ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Here is an attached patch.
Best, Serguei. Le 29/05/2017 à 12:21, Serguei Sokol a écrit : > The problem or actual R implementation relies on an assumption > that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) > which reveals not to be true despite very trustful appearance. > > If we continue with the example of x=y=1:9 > then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) > while median(y[i] | x[i] <= quantile(x, 1/3))=2 > On the other sample's side we've got 7.5 and 8 for x and y respectively. > Hence the slope (8-2)/(7.5-2.5)=1.2 > > To get a correct version of this, one should calculate x robust points in the same way as the y's, > i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) > > Best, > Serguei. > > Le 29/05/2017 à 10:02, peter dalgaard a écrit : >> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: >> >> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line >> >> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). >> >> Notice, incidentally, that >> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >> Call: >> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >> >> Coefficients: >> [1] -0.9407 1.1948 >> >> I.e., it is not likely an issue with exact integers or perfect fit. >> >> -pd >> >> >> >>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>> separately. >>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>> of days. >>> >>> Ah well, I can't get it for a week. But the fact that it's often called >>> Tukey's three group line (try a search on *tukey three group line* and >>> you'll get plenty of hits) is pretty much a giveaway. >>> >>> >>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>> of days. >>>> >>>> >>>> >>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> >>>> wrote: >>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>>> or >>>>>> 3 >>>>>> >>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>>> intercept -1 and slope 1.2 >>>>>> >>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>>> of >>>>>> length 6, with four of every six correct. >>>>>> >>>>>> Bug has been present across many versions. >>>>>> >>>>>> The machine I just tried it on just now has R3.2.3: >>>>>> >>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>>> to a particular definition of quantile), and the y value is chosen as the >>>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>>> Those are different definitions (though I think they would be >>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>>> line ends up "wrong". >>>>> >>>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>>> a copy of his book handy so I can't really say. Maybe the R function is >>>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>>> is wrong. >>>>> >>>>> Duncan Murdoch >>>>> >>>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> [hidden email] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: [hidden email] http://metasys.insa-toulouse.fr http://www.lisbp.fr ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel line.c.patch (2K) Download Attachment |
Sorry, I have seen it too late that we had different tab width in the original file and my editor.
Here is the patch with all white spaces instead of mixing tabs and white spaces. Serguei. Le 29/05/2017 à 15:13, Serguei Sokol a écrit : > Here is an attached patch. > > Best, > Serguei. > > Le 29/05/2017 à 12:21, Serguei Sokol a écrit : >> The problem or actual R implementation relies on an assumption >> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) >> which reveals not to be true despite very trustful appearance. >> >> If we continue with the example of x=y=1:9 >> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) >> while median(y[i] | x[i] <= quantile(x, 1/3))=2 >> On the other sample's side we've got 7.5 and 8 for x and y respectively. >> Hence the slope (8-2)/(7.5-2.5)=1.2 >> >> To get a correct version of this, one should calculate x robust points in the same way as the y's, >> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) >> >> Best, >> Serguei. >> >> Le 29/05/2017 à 10:02, peter dalgaard a écrit : >>> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: >>> >>> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line >>> >>> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). >>> >>> Notice, incidentally, that >>> >>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >>> Call: >>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >>> >>> Coefficients: >>> [1] -0.9407 1.1948 >>> >>> I.e., it is not likely an issue with exact integers or perfect fit. >>> >>> -pd >>> >>> >>> >>>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote: >>>> >>>>> Tukey divides the points into three groups, not the x and y values >>>> separately. >>>> >>>>> I'll try to get hold of the book for a direct quote, might take a couple >>>> of days. >>>> >>>> Ah well, I can't get it for a week. But the fact that it's often called >>>> Tukey's three group line (try a search on *tukey three group line* and >>>> you'll get plenty of hits) is pretty much a giveaway. >>>> >>>> >>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: >>>> >>>>> Tukey divides the points into three groups, not the x and y values >>>>> separately. >>>>> >>>>> I'll try to get hold of the book for a direct quote, might take a couple >>>>> of days. >>>>> >>>>> >>>>> >>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]> >>>>> wrote: >>>>> >>>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>>> >>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>>>> or >>>>>>> 3 >>>>>>> >>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>>>> intercept -1 and slope 1.2 >>>>>>> >>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>>>> of >>>>>>> length 6, with four of every six correct. >>>>>>> >>>>>>> Bug has been present across many versions. >>>>>>> >>>>>>> The machine I just tried it on just now has R3.2.3: >>>>>>> >>>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>>>> to a particular definition of quantile), and the y value is chosen as the >>>>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>>>> Those are different definitions (though I think they would be >>>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>>>> line ends up "wrong". >>>>>> >>>>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>>>> a copy of his book handy so I can't really say. Maybe the R function is >>>>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>>>> is wrong. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: [hidden email] http://metasys.insa-toulouse.fr http://www.lisbp.fr ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel line.c.patch (3K) Download Attachment |
In reply to this post by Serguei Sokol
Incidentally (though I don't expect anyone will want to pursue it)
Johnstone & Velleman give standard errors for the estimates in Johnstone, Iain M., and Paul F. Velleman. “The Resistant Line and Related Regression Methods.” Journal of the American Statistical Association, vol. 80, no. 392, 1985, pp. 1041–1054. On Mon, May 29, 2017 at 11:13 PM, Serguei Sokol <[hidden email]> wrote: > Here is an attached patch. > > Best, > Serguei. > > > Le 29/05/2017 à 12:21, Serguei Sokol a écrit : > >> The problem or actual R implementation relies on an assumption >> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) >> which reveals not to be true despite very trustful appearance. >> >> If we continue with the example of x=y=1:9 >> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not >> R's one) >> while median(y[i] | x[i] <= quantile(x, 1/3))=2 >> On the other sample's side we've got 7.5 and 8 for x and y respectively. >> Hence the slope (8-2)/(7.5-2.5)=1.2 >> >> To get a correct version of this, one should calculate x robust points in >> the same way as the y's, >> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >> >= quantile(x, 2/3)) >> >> Best, >> Serguei. >> >> Le 29/05/2017 à 10:02, peter dalgaard a écrit : >> >>> A usually trustworthy R correspondent posted a pure R implementation on >>> SO at some point in his lost youth: >>> >>> https://stackoverflow.com/questions/3224731/john-tukey-media >>> n-median-or-resistant-line-statistical-test-for-r-and-line >>> >>> This one does indeed generate the line of identity for the (1:9, 1:9) >>> case, so I do suspect that we have a genuine scr*wup in line(). >>> >>> Notice, incidentally, that >>> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >>>> >>> Call: >>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >>> >>> Coefficients: >>> [1] -0.9407 1.1948 >>> >>> I.e., it is not likely an issue with exact integers or perfect fit. >>> >>> -pd >>> >>> >>> >>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>>> >>>> of days. >>>> >>>> Ah well, I can't get it for a week. But the fact that it's often called >>>> Tukey's three group line (try a search on *tukey three group line* and >>>> you'll get plenty of hits) is pretty much a giveaway. >>>> >>>> >>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> separately. >>>>> >>>>> I'll try to get hold of the book for a direct quote, might take a >>>>> couple >>>>> of days. >>>>> >>>>> >>>>> >>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch < >>>>> [hidden email]> >>>>> wrote: >>>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 >>>>>>> is 2 >>>>>>> or >>>>>>> 3 >>>>>>> >>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it >>>>>>> gives >>>>>>> intercept -1 and slope 1.2 >>>>>>> >>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a >>>>>>> cycle >>>>>>> of >>>>>>> length 6, with four of every six correct. >>>>>>> >>>>>>> Bug has been present across many versions. >>>>>>> >>>>>>> The machine I just tried it on just now has R3.2.3: >>>>>>> >>>>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>>> explanation is clear: the x value is chosen as the 1/6 quantile >>>>>> (according >>>>>> to a particular definition of quantile), and the y value is chosen as >>>>>> the >>>>>> median of the y values where x is less than or equal to the 1/3 >>>>>> quantile. >>>>>> Those are different definitions (though I think they would be >>>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>>> surprising the x value doesn't correspond perfectly to the y value, >>>>>> and the >>>>>> line ends up "wrong". >>>>>> >>>>>> So is it a bug? Well, that depends on Tukey's definition. I don't >>>>>> have >>>>>> a copy of his book handy so I can't really say. Maybe the R function >>>>>> is >>>>>> doing exactly what Tukey said it should, and that's not a bug. Or >>>>>> maybe R >>>>>> is wrong. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >> >> > -- > Serguei Sokol > Ingenieur de recherche INRA > Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) > > LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 > 135 Avenue de Rangueil > 31077 Toulouse Cedex 04 > > tel: +33 5 6155 9276 > fax: +33 5 6704 8825 > email: [hidden email] > http://metasys.insa-toulouse.fr > http://www.lisbp.fr > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by Serguei Sokol
>>>>> Serguei Sokol <[hidden email]>
>>>>> on Mon, 29 May 2017 15:28:12 +0200 writes: > Sorry, I have seen it too late that we had different tab > width in the original file and my editor. Here is the > patch with all white spaces instead of mixing tabs and > white spaces. thank you - it still gives quite a few unnecessary (i.e. "white space") diffs with the source code, but that's not the problem : The patch leads to correct results for the simple (1:k, 1:k) data sets (for all k). However, even after the patch, The example from the SO post differs from the result of Richie Cotton's function...(even though that function had a silly bug in step 1, the bug has not been "kicking" for the example): Here's a fixed-up version of the pure R function and the example and some comments : ## From Stackoverflow ## http://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line ## median_median_line by Richie Cotton (July 12 2010, last edited at 13:49) ## ## Shorter variable names, fixed bug in step 1, added 'plot.' option: Martin Maechler, May 2017 MMline <- function(x, y, data, plot. = FALSE) { if(!missing(data)) { x <- eval(substitute(x), data) y <- eval(substitute(y), data) } stopifnot((n <- length(x)) == length(y), n >= 2) ## Step 1 n.3 <- n %/% 3L groups <- rep(1:3, times = switch((n %% 3) + 1L, n.3, c(n.3, n.3 + 1L, n.3), c(n.3 + 1L, n.3, n.3 + 1L) )) groups <- lapply(list(groups), as.factor) # (gain a bit in tapply()) ## Step 2 ## sort *both* x & y "along x": x <- sort.int(x, index.return = TRUE) y <- y[x$ix] x <- x$x if(plot.) plot(x,y) ## Step 3 m_x <- tapply(x, groups, median) m_y <- tapply(y, groups, median) if(plot.) { points(m_x, m_y, cex=2, pch=3, col="red") segments(m_x[1],m_y[1], m_x[3],m_y[3], col="red") } ## Step 4 R <- if(n == 2) 2L else 3L slope <- (m_y[[R]] - m_y[[1]]) / (m_x[[R]] - m_x[[1]]) intercept <- m_y[[1]] - slope * m_x[[1]] ## Step 5 mid_y <- intercept + slope * m_x[[2]] intercept <- intercept + (m_y[[2]] - mid_y) / 3 if(plot.) abline(a = intercept, b = slope, col=adjustcolor("midnight blue", .5), lwd=2) c(intercept = intercept, slope = slope) } ## To test it, here's the second example from that page: dfr <- data.frame( time = c( .16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89), distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1,124.6,129.7, 150.2,182.2,189.4,220.4,250.4, 261.0,334.5,375.5,399.1)) MMline(time, distance, dfr, plot.=TRUE) ## intercept slope ## -113.6 520.0 par(new=TRUE)# should plot identically! MMline(time, distance, dfr[sample.int(nrow(dfr)), ], plot.=TRUE) ## Note the slightly odd way of specifying the groups. The instructions are ## quite picky about how you define group sizes, so the more obvious method ## of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work. ## edited Jul 12 '10 at 13:49 / answered Jul 12 '10 at 13:36 ## Richie Cotton ## And someone remarked that R's line() did not give the same: with(dfr, line(time, distance)) ## ... ## Coefficients: ## [1] -108.8 505.2 abline(-108.8, 505.2, col="blue") ##=> this one is wrong ## MM: (cfs <- t(sapply(setNames(,2:50), function(k) {x <- 1:k; MMline(x,x)}))) ## intercept slope ## 2 0 1 # (special case fixed) ## 3 0 1 ## 4 0 1 ## 5 0 1 ## ............ ## 49 0 1 ## 50 0 1 ## "Harder": (sort ing!) cf2 <- t(sapply(setNames(,2:50), function(k) {x <- sample.int(k); MMline(x, -10*x)})) stopifnot(abs(cf2[,"intercept"] - 0) < 1e-12, abs(cf2[, "slope"] - -10) < 1e-12) ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Le 30/05/2017 à 09:33, Martin Maechler a écrit :
... > However, even after the patch, > The example from the SO post differs from the result of Richie > Cotton's function... The explanation is quite simple. In SO function, the first 1/3 quantile of used example counts 6 points (of 19 in total), while line()'s definition of quantile leads to 8 points. The same numbers (6 and 8) are on the other end of sample. In x sample, there are few repeated values, this is certainly be the reason of different quantiles.. I am not sure that one quantile definition is better or more correct than the other. So I would leave line()'s definition as is. Best, Sergueï. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
>>>>> Serguei Sokol <[hidden email]>
>>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ... >> However, even after the patch, The example from the SO >> post differs from the result of Richie Cotton's >> function... > The explanation is quite simple. In SO function, the first > 1/3 quantile of used example counts 6 points (of 19 in > total), while line()'s definition of quantile leads to 8 > points. The same numbers (6 and 8) are on the other end of > sample. so the number of obs. for the three thirds for line() are {8, 3, 8} in line() [also, after your patch, right?] whereas in MMline() they are as they should be, namely {6, 7, 6} But the {8, 3, 8} split is not at all what all "the literature", including Tukey himself says that "should" be done. (Other literature on the topic suggests that the optimal sizes of the split in three groups depends on the distribution of x ..) OTOH, MMline() does exactly what "the literature" and also the reference on the ?line help pages says. > In x sample, there are few repeated values, this > is certainly be the reason of different quantiles.. > I am not sure that one quantile definition is better or > more correct than the other. > So I would leave line()'s definition as is. you mean _after_ applying your patch, I assume. I currently tend do disagree. If we change line() we should rather fix more .. Note the 'Subject' you've chosen for this thread, "... does not produce the correct Tukey line", so I think we should get better. Apart from Richie / my MMline() function, I've also noticed that ACSWR :: resistant_line() exists. However "the literature" (see references below), notably the two with Hoaglin, strongly recommends smarter iterations, and -- lo and behold! -- when this topic came up last (for me) in Dec. 2014, I did spend about 2 days work (or more?) to get the FORTRAN code from the 1981 - book (which is abbreviated the "ABC of EDA") from a somewhat useful OCR scan into compilable Fortran code and then f2c'ed, wrote an R interface function found problems i.e., bugs, including infinite loops, fixed most AFAICS, but somehow did not finish making the result available. Yes, and I have too many other things on my desk... this will have to wait! References: Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading Massachusetts: Addison-Wesley. Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics and Computing of Exploratory Data Analysis_ Duxbury Press. Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y versus x. Chapter 5 of _Understanding Robust and Exploratory Data Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. Tukey. Wiley. Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line and Related Regression Methods. _Journal of the American Statistical Association_ *80*, 1041-1054. <URL: https://dx.doi.org/10.1080/01621459.1985.10478222> > Best, Sergueï. Martin Maechler, ETH Zurich (and R core team) ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Martin Maechler says in reply to Sergueï Sokol
> Note the 'Subject' you've chosen for this thread, "... does not produce the correct Tukey line", The choice of title was mine not Serguei's; I posted the original message where the error was pointed out I agree with Martin's assessment that the correct split (both by Tukey's lights and by general practice) for 19 points would be 6,7,6 and I also agree that it's better to "fix more" in this instance, where possible. (e.g. Johnstone&Velleman's standard errors would be a nice thing to add if feasible) -- but if any blame is attached to the choice of title, it really should be aimed at me. Glen On Wed, May 31, 2017 at 2:51 AM, Martin Maechler <[hidden email] > wrote: > >>>>> Serguei Sokol <[hidden email]> > >>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: > > > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ... > >> However, even after the patch, The example from the SO > >> post differs from the result of Richie Cotton's > >> function... > > The explanation is quite simple. In SO function, the first > > 1/3 quantile of used example counts 6 points (of 19 in > > total), while line()'s definition of quantile leads to 8 > > points. The same numbers (6 and 8) are on the other end of > > sample. > > so the number of obs. for the three thirds for line() are > {8, 3, 8} in line() [also, after your patch, right?] > > whereas in MMline() they are as they should be, namely > > {6, 7, 6} > > But the {8, 3, 8} split is not at all what all "the literature", > including Tukey himself says that "should" be done. > (Other literature on the topic suggests that the optimal sizes > of the split in three groups depends on the distribution of x ..) > > OTOH, MMline() does exactly what "the literature" and also the > reference on the ?line help pages says. > > > In x sample, there are few repeated values, this > > is certainly be the reason of different quantiles.. > > > I am not sure that one quantile definition is better or > > more correct than the other. > > > So I would leave line()'s definition as is. > > you mean _after_ applying your patch, I assume. > > I currently tend do disagree. If we change line() we should > rather fix more .. > Note the 'Subject' you've chosen for this thread, > "... does not produce the correct Tukey line", > so I think we should get better. > > Apart from Richie / my MMline() function, I've also noticed > that ACSWR :: resistant_line() > exists. > > However "the literature" (see references below), notably the two > with Hoaglin, strongly recommends smarter iterations, and > -- lo and behold! -- when this topic came up last (for me) in > Dec. 2014, I did spend about 2 days work (or more?) to get the > FORTRAN code from the 1981 - book (which is abbreviated the > "ABC of EDA") from a somewhat useful OCR scan into compilable > Fortran code and then f2c'ed, wrote an R interface function > found problems i.e., bugs, including infinite loops, fixed most > AFAICS, but somehow did not finish making the result available. > > Yes, and I have too many other things on my desk... this will > have to wait! > > References: > > Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading > Massachusetts: Addison-Wesley. > > Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics > and Computing of Exploratory Data Analysis_ Duxbury Press. > > Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y > versus x. Chapter 5 of _Understanding Robust and Exploratory Data > Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. > Tukey. Wiley. > > Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line > and Related Regression Methods. _Journal of the American > Statistical Association_ *80*, 1041-1054. <URL: > https://dx.doi.org/10.1080/01621459.1985.10478222> > > > > Best, Sergueï. > > Martin Maechler, ETH Zurich (and R core team) > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by Martin Maechler
Le 30/05/2017 à 18:51, Martin Maechler a écrit :
>>>>>> Serguei Sokol <[hidden email]> >>>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: > > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ... > >> However, even after the patch, The example from the SO > >> post differs from the result of Richie Cotton's > >> function... > > The explanation is quite simple. In SO function, the first > > 1/3 quantile of used example counts 6 points (of 19 in > > total), while line()'s definition of quantile leads to 8 > > points. The same numbers (6 and 8) are on the other end of > > sample. > > so the number of obs. for the three thirds for line() are > {8, 3, 8} in line() [also, after your patch, right?] > > whereas in MMline() they are as they should be, namely > > {6, 7, 6} > > But the {8, 3, 8} split is not at all what all "the literature", > including Tukey himself says that "should" be done. > (Other literature on the topic suggests that the optimal sizes > of the split in three groups depends on the distribution of x ..) > > OTOH, MMline() does exactly what "the literature" and also the > reference on the ?line help pages says. (but, yes I could overlook smth as I did not spend too much time on it) So the sample distribution in three groups boils down to a particular quantile definition to use. It turns out that the line()'s version (you are right, _after_ the patch but my patch left this definition untouched) is consistent with the R's one. If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6 (and the same on the 2/3 end). To my mind, consistency with the rest of R, namely with the quantile definition, is an argument good enough to let the line()'s definition as is. Serguei. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
OTOH,
> sapply(1:9, function(i){ + sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) + }) [1] 8 8 6 6 6 6 8 6 6 Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other 6 types might be better suited in Tukey's method. So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with the quantile definition in R. At least not with 6 out of the 9 different ones ;-) Cheers Joris On Wed, May 31, 2017 at 3:06 PM, Serguei Sokol <[hidden email]> wrote: > Le 30/05/2017 à 18:51, Martin Maechler a écrit : > >> Serguei Sokol <[hidden email]> >>>>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: >>>>>>> >>>>>> > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ... >> >> However, even after the patch, The example from the SO >> >> post differs from the result of Richie Cotton's >> >> function... >> > The explanation is quite simple. In SO function, the first >> > 1/3 quantile of used example counts 6 points (of 19 in >> > total), while line()'s definition of quantile leads to 8 >> > points. The same numbers (6 and 8) are on the other end of >> > sample. >> >> so the number of obs. for the three thirds for line() are >> {8, 3, 8} in line() [also, after your patch, right?] >> >> whereas in MMline() they are as they should be, namely >> >> {6, 7, 6} >> >> But the {8, 3, 8} split is not at all what all "the literature", >> including Tukey himself says that "should" be done. >> (Other literature on the topic suggests that the optimal sizes >> of the split in three groups depends on the distribution of x ..) >> >> OTOH, MMline() does exactly what "the literature" and also the >> reference on the ?line help pages says. >> > Well, what I have seen so far in "literature" was mention of 1/3 quantiles > (but, yes I could overlook smth as I did not spend too much time on it) > So the sample distribution in three groups boils down to a particular > quantile > definition to use. It turns out that the line()'s version (you are right, > _after_ the patch > but my patch left this definition untouched) is consistent with the R's > one. > If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6 > (and the same on the 2/3 end). > To my mind, consistency with the rest of R, namely with the quantile > definition, > is an argument good enough to let the line()'s definition as is. > > Serguei. > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 (0)9 264 61 79 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Le 31/05/2017 à 15:40, Joris Meys a écrit :
> OTOH, > > > sapply(1:9, function(i){ > + sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) > + }) > [1] 8 8 6 6 6 6 8 6 6 > > Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other 6 types > might be better suited in Tukey's method. > > So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with the > quantile definition in R. At least not with 6 out of the 9 different ones ;-) But OTOE (on the other end ;) > sapply(1:9, function(i){ + sum(dfr$time >= quantile(dfr$time, 2./3., type = i)) + }) [1] 8 8 8 8 6 6 8 6 6 Here "8" gains 5 votes against 4 for "6". There were two defector methods that changed the point number and should be discarded. Which leaves us with the score 3:4, still in favor of "6" but the default method should prevail in my sens. Serguei. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Seriously, if a method gives a wrong result, it's wrong. line() does NOT
implement the algorithm of Tukey, even not after the patch. We're not discussing Excel here, are we? The method of Tukey is rather clear, and it is NOT using the default quantile definition from the quantile function. Actually, it doesn't even use quantiles to define the groups. It just says that the groups should be more or less equally spaced. As the method of Tukey relies on the medians of the subgroups, it would make sense to pick a method that is approximately unbiased with regard to the median. That would be type 8 imho. To get the size of the outer groups, Tukey would've been more than happy enough with a: > floor(length(dfr$time) / 3) [1] 6 There you have the size of your left and right group, and now we can discuss about which median type should be used for the robust fitting. But I can honestly not understand why anyone in his right mind would defend a method that is clearly wrong while not working at Microsoft's spreadsheet department. Cheers Joris On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]> wrote: > Le 31/05/2017 à 15:40, Joris Meys a écrit : > >> OTOH, >> >> > sapply(1:9, function(i){ >> + sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) >> + }) >> [1] 8 8 6 6 6 6 8 6 6 >> >> Only the default (type = 7) and the first two types give the result >> lines() gives now. I think there is plenty of reasons to give why any of >> the other 6 types might be better suited in Tukey's method. >> >> So to my mind, chaning the definition of line() to give sensible output >> that is in accordance with the theory, does not imply any inconsistency >> with the quantile definition in R. At least not with 6 out of the 9 >> different ones ;-) >> > Nice shot. > But OTOE (on the other end ;) > > sapply(1:9, function(i){ > + sum(dfr$time >= quantile(dfr$time, 2./3., type = i)) > + }) > [1] 8 8 8 8 6 6 8 6 6 > > Here "8" gains 5 votes against 4 for "6". There were two defector methods > that changed the point number and should be discarded. Which leaves us > with the score 3:4, still in favor of "6" but the default method should > prevail > in my sens. > > Serguei. > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 (0)9 264 61 79 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
And with "equally spaced" I obviously meant "of equal size". It's getting
too hot in the office here... On Wed, May 31, 2017 at 4:39 PM, Joris Meys <[hidden email]> wrote: > Seriously, if a method gives a wrong result, it's wrong. line() does NOT > implement the algorithm of Tukey, even not after the patch. We're not > discussing Excel here, are we? > > The method of Tukey is rather clear, and it is NOT using the default > quantile definition from the quantile function. Actually, it doesn't even > use quantiles to define the groups. It just says that the groups should be > more or less equally spaced. As the method of Tukey relies on the medians > of the subgroups, it would make sense to pick a method that is > approximately unbiased with regard to the median. That would be type 8 > imho. > > To get the size of the outer groups, Tukey would've been more than happy > enough with a: > > > floor(length(dfr$time) / 3) > [1] 6 > > There you have the size of your left and right group, and now we can > discuss about which median type should be used for the robust fitting. > > But I can honestly not understand why anyone in his right mind would > defend a method that is clearly wrong while not working at Microsoft's > spreadsheet department. > > Cheers > Joris > > On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]> > wrote: > >> Le 31/05/2017 à 15:40, Joris Meys a écrit : >> >>> OTOH, >>> >>> > sapply(1:9, function(i){ >>> + sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) >>> + }) >>> [1] 8 8 6 6 6 6 8 6 6 >>> >>> Only the default (type = 7) and the first two types give the result >>> lines() gives now. I think there is plenty of reasons to give why any of >>> the other 6 types might be better suited in Tukey's method. >>> >>> So to my mind, chaning the definition of line() to give sensible output >>> that is in accordance with the theory, does not imply any inconsistency >>> with the quantile definition in R. At least not with 6 out of the 9 >>> different ones ;-) >>> >> Nice shot. >> But OTOE (on the other end ;) >> > sapply(1:9, function(i){ >> + sum(dfr$time >= quantile(dfr$time, 2./3., type = i)) >> + }) >> [1] 8 8 8 8 6 6 8 6 6 >> >> Here "8" gains 5 votes against 4 for "6". There were two defector methods >> that changed the point number and should be discarded. Which leaves us >> with the score 3:4, still in favor of "6" but the default method should >> prevail >> in my sens. >> >> Serguei. >> > > > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Mathematical Modelling, Statistics and Bio-Informatics > > tel : +32 (0)9 264 61 79 <+32%209%20264%2061%2079> > [hidden email] > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > -- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 (0)9 264 61 79 [hidden email] ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
> On 31 May 2017, at 16:40 , Joris Meys <[hidden email]> wrote: > > And with "equally spaced" I obviously meant "of equal size". It's getting > too hot in the office here... We have a fair amount of cool westerly wind up here that I could transfer to you via WWTP (Wind and Weather Transport Protocol). If you open up a sufficiently large pipe, that is. Anyways, in the past we have tried to follow Tukey's instructions on details like the definition of the "hinges" on boxplots, so presumably we should try and do likewise for this case. I suspect that Tukey would say "divide the data into three roughly equal-sized groups" or some such. The obvious thing to do would be to allocate N %/% 3 to each group and then the N %% 3 remaining symmetrically and as evenly as possible, which in my book would rather be (1,0,1) than (0, 2, 0) for the case N %% 3 == 2. If N %% 3 == 1, there is no alternative to (0, 1, 0) by this logic. > > On Wed, May 31, 2017 at 4:39 PM, Joris Meys <[hidden email]> wrote: > >> Seriously, if a method gives a wrong result, it's wrong. line() does NOT >> implement the algorithm of Tukey, even not after the patch. We're not >> discussing Excel here, are we? >> >> The method of Tukey is rather clear, and it is NOT using the default >> quantile definition from the quantile function. Actually, it doesn't even >> use quantiles to define the groups. It just says that the groups should be >> more or less equally spaced. As the method of Tukey relies on the medians >> of the subgroups, it would make sense to pick a method that is >> approximately unbiased with regard to the median. That would be type 8 >> imho. >> >> To get the size of the outer groups, Tukey would've been more than happy >> enough with a: >> >>> floor(length(dfr$time) / 3) >> [1] 6 >> >> There you have the size of your left and right group, and now we can >> discuss about which median type should be used for the robust fitting. >> >> But I can honestly not understand why anyone in his right mind would >> defend a method that is clearly wrong while not working at Microsoft's >> spreadsheet department. >> >> Cheers >> Joris >> >> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]> >> wrote: >> >>> Le 31/05/2017 à 15:40, Joris Meys a écrit : >>> >>>> OTOH, >>>> >>>>> sapply(1:9, function(i){ >>>> + sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) >>>> + }) >>>> [1] 8 8 6 6 6 6 8 6 6 >>>> >>>> Only the default (type = 7) and the first two types give the result >>>> lines() gives now. I think there is plenty of reasons to give why any of >>>> the other 6 types might be better suited in Tukey's method. >>>> >>>> So to my mind, chaning the definition of line() to give sensible output >>>> that is in accordance with the theory, does not imply any inconsistency >>>> with the quantile definition in R. At least not with 6 out of the 9 >>>> different ones ;-) >>>> >>> Nice shot. >>> But OTOE (on the other end ;) >>>> sapply(1:9, function(i){ >>> + sum(dfr$time >= quantile(dfr$time, 2./3., type = i)) >>> + }) >>> [1] 8 8 8 8 6 6 8 6 6 >>> >>> Here "8" gains 5 votes against 4 for "6". There were two defector methods >>> that changed the point number and should be discarded. Which leaves us >>> with the score 3:4, still in favor of "6" but the default method should >>> prevail >>> in my sens. >>> >>> Serguei. >>> >> >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Ghent University >> Faculty of Bioscience Engineering >> Department of Mathematical Modelling, Statistics and Bio-Informatics >> >> tel : +32 (0)9 264 61 79 <+32%209%20264%2061%2079> >> [hidden email] >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> > > > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Mathematical Modelling, Statistics and Bio-Informatics > > tel : +32 (0)9 264 61 79 > [hidden email] > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Free forum by Nabble | Edit this page |