stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

GlenB
Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or
3

Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2

Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of
length 6, with four of every six correct.

Bug has been present across many versions.

The machine I just tried it on just now has R3.2.3:

               _
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status
major          3
minor          2.3
year           2015
month          12
day            10
svn rev        69752
language       R
version.string R version 3.2.3 (2015-12-10)
nickname       Wooden Christmas-Tree

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Joris FA Meys
Can confirm this in R 3.4.0 :

end <- 6:100
res <- lapply(end, function(i) line(1:i,1:i))
absresid <- sapply(res, function(i) mean(abs(resid(i))))
plot(absresid, type = "h")
coefs <- sapply(res, coef)
plot(coefs[1,], coefs[2,])

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0




On Sun, May 28, 2017 at 3:28 AM, GlenB <[hidden email]> wrote:

> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or
> 3
>
> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
> intercept -1 and slope 1.2
>
> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of
> length 6, with four of every six correct.
>
> Bug has been present across many versions.
>
> The machine I just tried it on just now has R3.2.3:
>
>                _
> platform       x86_64-w64-mingw32
> arch           x86_64
> os             mingw32
> system         x86_64, mingw32
> status
> major          3
> minor          2.3
> year           2015
> month          12
> day            10
> svn rev        69752
> language       R
> version.string R version 3.2.3 (2015-12-10)
> nickname       Wooden Christmas-Tree
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Duncan Murdoch-2
In reply to this post by GlenB
On 27/05/2017 9:28 PM, GlenB wrote:

> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or
> 3
>
> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
> intercept -1 and slope 1.2
>
> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of
> length 6, with four of every six correct.
>
> Bug has been present across many versions.
>
> The machine I just tried it on just now has R3.2.3:

If you look at the source (in src/library/stats/src/line.c), the
explanation is clear:  the x value is chosen as the 1/6 quantile
(according to a particular definition of quantile), and the y value is
chosen as the median of the y values where x is less than or equal to
the 1/3 quantile.  Those are different definitions (though I think they
would be asymptotically equivalent under pretty weak assumptions), so
it's not surprising the x value doesn't correspond perfectly to the y
value, and the line ends up "wrong".

So is it a bug?  Well, that depends on Tukey's definition.  I don't have
a copy of his book handy so I can't really say.  Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug.  Or maybe
R is wrong.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

GlenB
Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple of
days.



On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
wrote:

> On 27/05/2017 9:28 PM, GlenB wrote:
>
>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>> or
>> 3
>>
>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>> intercept -1 and slope 1.2
>>
>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of
>> length 6, with four of every six correct.
>>
>> Bug has been present across many versions.
>>
>> The machine I just tried it on just now has R3.2.3:
>>
>
> If you look at the source (in src/library/stats/src/line.c), the
> explanation is clear:  the x value is chosen as the 1/6 quantile (according
> to a particular definition of quantile), and the y value is chosen as the
> median of the y values where x is less than or equal to the 1/3 quantile.
> Those are different definitions (though I think they would be
> asymptotically equivalent under pretty weak assumptions), so it's not
> surprising the x value doesn't correspond perfectly to the y value, and the
> line ends up "wrong".
>
> So is it a bug?  Well, that depends on Tukey's definition.  I don't have a
> copy of his book handy so I can't really say.  Maybe the R function is
> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
> is wrong.
>
> Duncan Murdoch
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

GlenB
> Tukey divides the points into three groups, not the x and y values
separately.

> I'll try to get hold of the book for a direct quote, might take a couple
of days.

Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.


On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:

> Tukey divides the points into three groups, not the x and y values
> separately.
>
> I'll try to get hold of the book for a direct quote, might take a couple
> of days.
>
>
>
> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
> wrote:
>
>> On 27/05/2017 9:28 PM, GlenB wrote:
>>
>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>> or
>>> 3
>>>
>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>> intercept -1 and slope 1.2
>>>
>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>> of
>>> length 6, with four of every six correct.
>>>
>>> Bug has been present across many versions.
>>>
>>> The machine I just tried it on just now has R3.2.3:
>>>
>>
>> If you look at the source (in src/library/stats/src/line.c), the
>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>> to a particular definition of quantile), and the y value is chosen as the
>> median of the y values where x is less than or equal to the 1/3 quantile.
>> Those are different definitions (though I think they would be
>> asymptotically equivalent under pretty weak assumptions), so it's not
>> surprising the x value doesn't correspond perfectly to the y value, and the
>> line ends up "wrong".
>>
>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>> a copy of his book handy so I can't really say.  Maybe the R function is
>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>> is wrong.
>>
>> Duncan Murdoch
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Peter Dalgaard-2
A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that

> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd



> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote:
>
>> Tukey divides the points into three groups, not the x and y values
> separately.
>
>> I'll try to get hold of the book for a direct quote, might take a couple
> of days.
>
> Ah well, I can't get it for a week. But the fact that it's often called
> Tukey's three group line (try a search on *tukey three group line* and
> you'll get plenty of hits) is pretty much a giveaway.
>
>
> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:
>
>> Tukey divides the points into three groups, not the x and y values
>> separately.
>>
>> I'll try to get hold of the book for a direct quote, might take a couple
>> of days.
>>
>>
>>
>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
>> wrote:
>>
>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>
>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>> or
>>>> 3
>>>>
>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>> intercept -1 and slope 1.2
>>>>
>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>> of
>>>> length 6, with four of every six correct.
>>>>
>>>> Bug has been present across many versions.
>>>>
>>>> The machine I just tried it on just now has R3.2.3:
>>>>
>>>
>>> If you look at the source (in src/library/stats/src/line.c), the
>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>> to a particular definition of quantile), and the y value is chosen as the
>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>> Those are different definitions (though I think they would be
>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>> line ends up "wrong".
>>>
>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>> is wrong.
>>>
>>> Duncan Murdoch
>>>
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.

If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2

To get a correct version of this, one should calculate x robust points in the same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3))

Best,
Serguei.

Le 29/05/2017 à 10:02, peter dalgaard a écrit :

> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:
>
> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
>
> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().
>
> Notice, incidentally, that
>
>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
> Call:
> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>
> Coefficients:
> [1]  -0.9407   1.1948
>
> I.e., it is not likely an issue with exact integers or perfect fit.
>
> -pd
>
>
>
>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote:
>>
>>> Tukey divides the points into three groups, not the x and y values
>> separately.
>>
>>> I'll try to get hold of the book for a direct quote, might take a couple
>> of days.
>>
>> Ah well, I can't get it for a week. But the fact that it's often called
>> Tukey's three group line (try a search on *tukey three group line* and
>> you'll get plenty of hits) is pretty much a giveaway.
>>
>>
>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:
>>
>>> Tukey divides the points into three groups, not the x and y values
>>> separately.
>>>
>>> I'll try to get hold of the book for a direct quote, might take a couple
>>> of days.
>>>
>>>
>>>
>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
>>> wrote:
>>>
>>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>>
>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>>> or
>>>>> 3
>>>>>
>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>>> intercept -1 and slope 1.2
>>>>>
>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>>> of
>>>>> length 6, with four of every six correct.
>>>>>
>>>>> Bug has been present across many versions.
>>>>>
>>>>> The machine I just tried it on just now has R3.2.3:
>>>>>
>>>> If you look at the source (in src/library/stats/src/line.c), the
>>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>>> to a particular definition of quantile), and the y value is chosen as the
>>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>>> Those are different definitions (though I think they would be
>>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>>> line ends up "wrong".
>>>>
>>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>>> is wrong.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: [hidden email]
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
Here is an attached patch.

Best,
Serguei.

Le 29/05/2017 à 12:21, Serguei Sokol a écrit :

> The problem or actual R implementation relies on an assumption
> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
> which reveals not to be true despite very trustful appearance.
>
> If we continue with the example of x=y=1:9
> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
> while median(y[i] | x[i] <= quantile(x, 1/3))=2
> On the other sample's side we've got 7.5 and 8 for x and y respectively.
> Hence the slope (8-2)/(7.5-2.5)=1.2
>
> To get a correct version of this, one should calculate x robust points in the same way as the y's,
> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3))
>
> Best,
> Serguei.
>
> Le 29/05/2017 à 10:02, peter dalgaard a écrit :
>> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:
>>
>> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
>>
>> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().
>>
>> Notice, incidentally, that
>>
>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
>> Call:
>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>>
>> Coefficients:
>> [1]  -0.9407   1.1948
>>
>> I.e., it is not likely an issue with exact integers or perfect fit.
>>
>> -pd
>>
>>
>>
>>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote:
>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>> separately.
>>>
>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>> of days.
>>>
>>> Ah well, I can't get it for a week. But the fact that it's often called
>>> Tukey's three group line (try a search on *tukey three group line* and
>>> you'll get plenty of hits) is pretty much a giveaway.
>>>
>>>
>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:
>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>>> separately.
>>>>
>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>>> of days.
>>>>
>>>>
>>>>
>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
>>>> wrote:
>>>>
>>>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>>>
>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>>>> or
>>>>>> 3
>>>>>>
>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>>>> intercept -1 and slope 1.2
>>>>>>
>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>>>> of
>>>>>> length 6, with four of every six correct.
>>>>>>
>>>>>> Bug has been present across many versions.
>>>>>>
>>>>>> The machine I just tried it on just now has R3.2.3:
>>>>>>
>>>>> If you look at the source (in src/library/stats/src/line.c), the
>>>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>>>> to a particular definition of quantile), and the y value is chosen as the
>>>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>>>> Those are different definitions (though I think they would be
>>>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>>>> line ends up "wrong".
>>>>>
>>>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>>>> is wrong.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: [hidden email]
http://metasys.insa-toulouse.fr
http://www.lisbp.fr


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

line.c.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
Sorry, I have seen it too late that we had different tab width in the original file and my editor.
Here is the patch with all white spaces instead of mixing tabs and white spaces.

Serguei.

Le 29/05/2017 à 15:13, Serguei Sokol a écrit :

> Here is an attached patch.
>
> Best,
> Serguei.
>
> Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
>> The problem or actual R implementation relies on an assumption
>> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
>> which reveals not to be true despite very trustful appearance.
>>
>> If we continue with the example of x=y=1:9
>> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
>> while median(y[i] | x[i] <= quantile(x, 1/3))=2
>> On the other sample's side we've got 7.5 and 8 for x and y respectively.
>> Hence the slope (8-2)/(7.5-2.5)=1.2
>>
>> To get a correct version of this, one should calculate x robust points in the same way as the y's,
>> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3))
>>
>> Best,
>> Serguei.
>>
>> Le 29/05/2017 à 10:02, peter dalgaard a écrit :
>>> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:
>>>
>>> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
>>>
>>> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().
>>>
>>> Notice, incidentally, that
>>>
>>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
>>> Call:
>>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>>>
>>> Coefficients:
>>> [1]  -0.9407   1.1948
>>>
>>> I.e., it is not likely an issue with exact integers or perfect fit.
>>>
>>> -pd
>>>
>>>
>>>
>>>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote:
>>>>
>>>>> Tukey divides the points into three groups, not the x and y values
>>>> separately.
>>>>
>>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>>> of days.
>>>>
>>>> Ah well, I can't get it for a week. But the fact that it's often called
>>>> Tukey's three group line (try a search on *tukey three group line* and
>>>> you'll get plenty of hits) is pretty much a giveaway.
>>>>
>>>>
>>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:
>>>>
>>>>> Tukey divides the points into three groups, not the x and y values
>>>>> separately.
>>>>>
>>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>>>> of days.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>>>>
>>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>>>>> or
>>>>>>> 3
>>>>>>>
>>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>>>>> intercept -1 and slope 1.2
>>>>>>>
>>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>>>>> of
>>>>>>> length 6, with four of every six correct.
>>>>>>>
>>>>>>> Bug has been present across many versions.
>>>>>>>
>>>>>>> The machine I just tried it on just now has R3.2.3:
>>>>>>>
>>>>>> If you look at the source (in src/library/stats/src/line.c), the
>>>>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>>>>> to a particular definition of quantile), and the y value is chosen as the
>>>>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>>>>> Those are different definitions (though I think they would be
>>>>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>>>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>>>>> line ends up "wrong".
>>>>>>
>>>>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>>>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>>>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>>>>> is wrong.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: [hidden email]
http://metasys.insa-toulouse.fr
http://www.lisbp.fr


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

line.c.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

GlenB
In reply to this post by Serguei Sokol
Incidentally (though I don't expect anyone will want to pursue it)
Johnstone & Velleman give standard errors for the estimates in Johnstone,
Iain M., and Paul F. Velleman. “The Resistant Line and Related Regression
Methods.” Journal of the American Statistical Association, vol. 80, no.
392, 1985, pp. 1041–1054.

On Mon, May 29, 2017 at 11:13 PM, Serguei Sokol <[hidden email]>
wrote:

> Here is an attached patch.
>
> Best,
> Serguei.
>
>
> Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
>
>> The problem or actual R implementation relies on an assumption
>> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
>> which reveals not to be true despite very trustful appearance.
>>
>> If we continue with the example of x=y=1:9
>> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not
>> R's one)
>> while median(y[i] | x[i] <= quantile(x, 1/3))=2
>> On the other sample's side we've got 7.5 and 8 for x and y respectively.
>> Hence the slope (8-2)/(7.5-2.5)=1.2
>>
>> To get a correct version of this, one should calculate x robust points in
>> the same way as the y's,
>> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i]
>> >= quantile(x, 2/3))
>>
>> Best,
>> Serguei.
>>
>> Le 29/05/2017 à 10:02, peter dalgaard a écrit :
>>
>>> A usually trustworthy R correspondent posted a pure R implementation on
>>> SO at some point in his lost youth:
>>>
>>> https://stackoverflow.com/questions/3224731/john-tukey-media
>>> n-median-or-resistant-line-statistical-test-for-r-and-line
>>>
>>> This one does indeed generate the line of identity for the (1:9, 1:9)
>>> case, so I do suspect that we have a genuine scr*wup in line().
>>>
>>> Notice, incidentally, that
>>>
>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
>>>>
>>> Call:
>>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>>>
>>> Coefficients:
>>> [1]  -0.9407   1.1948
>>>
>>> I.e., it is not likely an issue with exact integers or perfect fit.
>>>
>>> -pd
>>>
>>>
>>>
>>> On 29 May 2017, at 07:21 , GlenB <[hidden email]> wrote:
>>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>>>>
>>>> separately.
>>>>
>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>>>>
>>>> of days.
>>>>
>>>> Ah well, I can't get it for a week. But the fact that it's often called
>>>> Tukey's three group line (try a search on *tukey three group line* and
>>>> you'll get plenty of hits) is pretty much a giveaway.
>>>>
>>>>
>>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <[hidden email]> wrote:
>>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>>>> separately.
>>>>>
>>>>> I'll try to get hold of the book for a direct quote, might take a
>>>>> couple
>>>>> of days.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>>>>
>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6
>>>>>>> is 2
>>>>>>> or
>>>>>>> 3
>>>>>>>
>>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it
>>>>>>> gives
>>>>>>> intercept -1 and slope 1.2
>>>>>>>
>>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a
>>>>>>> cycle
>>>>>>> of
>>>>>>> length 6, with four of every six correct.
>>>>>>>
>>>>>>> Bug has been present across many versions.
>>>>>>>
>>>>>>> The machine I just tried it on just now has R3.2.3:
>>>>>>>
>>>>>>> If you look at the source (in src/library/stats/src/line.c), the
>>>>>> explanation is clear:  the x value is chosen as the 1/6 quantile
>>>>>> (according
>>>>>> to a particular definition of quantile), and the y value is chosen as
>>>>>> the
>>>>>> median of the y values where x is less than or equal to the 1/3
>>>>>> quantile.
>>>>>> Those are different definitions (though I think they would be
>>>>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>>>>> surprising the x value doesn't correspond perfectly to the y value,
>>>>>> and the
>>>>>> line ends up "wrong".
>>>>>>
>>>>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't
>>>>>> have
>>>>>> a copy of his book handy so I can't really say.  Maybe the R function
>>>>>> is
>>>>>> doing exactly what Tukey said it should, and that's not a bug.  Or
>>>>>> maybe R
>>>>>> is wrong.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>>>     [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>
>>
> --
> Serguei Sokol
> Ingenieur de recherche INRA
> Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)
>
> LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
> 135 Avenue de Rangueil
> 31077 Toulouse Cedex 04
>
> tel: +33 5 6155 9276
> fax: +33 5 6704 8825
> email: [hidden email]
> http://metasys.insa-toulouse.fr
> http://www.lisbp.fr
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Martin Maechler
In reply to this post by Serguei Sokol
>>>>> Serguei Sokol <[hidden email]>
>>>>>     on Mon, 29 May 2017 15:28:12 +0200 writes:

    > Sorry, I have seen it too late that we had different tab
    > width in the original file and my editor.  Here is the
    > patch with all white spaces instead of mixing tabs and
    > white spaces.

thank you - it still gives quite a few unnecessary (i.e. "white space")
 diffs with the source code, but that's not the problem :

The patch leads to correct results for the simple
 (1:k, 1:k)  data sets (for all k).

However, even after the patch,
The example from the SO post differs from the result of Richie
Cotton's function...(even though that function had a silly bug
         in step 1, the bug has not been "kicking" for the example):

Here's a fixed-up version of the pure R function and
the example and some comments :


## From Stackoverflow
## http://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
## median_median_line by Richie Cotton (July 12 2010, last edited at 13:49)
##
## Shorter variable names, fixed bug in step 1, added 'plot.' option: Martin Maechler, May 2017
MMline <- function(x, y, data, plot. = FALSE)
{
  if(!missing(data))
  {
    x <- eval(substitute(x), data)
    y <- eval(substitute(y), data)
  }
  stopifnot((n <- length(x)) == length(y), n >= 2)

  ## Step 1
  n.3 <- n %/% 3L
  groups <- rep(1:3, times = switch((n %% 3) + 1L,
     n.3,
     c(n.3, n.3 + 1L, n.3),
     c(n.3 + 1L, n.3, n.3 + 1L)
  ))
  groups <- lapply(list(groups), as.factor) # (gain a bit in tapply())

  ## Step 2
  ## sort *both*  x & y "along x":
  x <- sort.int(x, index.return = TRUE)
  y <- y[x$ix]
  x <- x$x
  if(plot.) plot(x,y)

  ## Step 3
  m_x <- tapply(x, groups, median)
  m_y <- tapply(y, groups, median)
  if(plot.) {
      points(m_x, m_y, cex=2, pch=3, col="red")
      segments(m_x[1],m_y[1], m_x[3],m_y[3], col="red")
  }
  ## Step 4
  R <- if(n == 2) 2L else 3L
  slope <- (m_y[[R]] - m_y[[1]]) / (m_x[[R]] - m_x[[1]])
  intercept <- m_y[[1]] - slope * m_x[[1]]

  ## Step 5
  mid_y <- intercept + slope * m_x[[2]]
  intercept <- intercept + (m_y[[2]] - mid_y) / 3
  if(plot.) abline(a = intercept, b = slope, col=adjustcolor("midnight blue", .5), lwd=2)
  c(intercept = intercept, slope = slope)
}

## To test it, here's the second example from that page:

dfr <- data.frame(
      time = c( .16,  .24,  .25,  .30,  .30,    .32,  .36,  .36,  .50,  .50,
                .57,  .61,  .61,  .68,  .72,    .72,  .83,  .88,  .89),
  distance = c(12.1, 29.8, 32.7, 42.8, 44.2,   55.8, 63.5, 65.1,124.6,129.7,
              150.2,182.2,189.4,220.4,250.4,  261.0,334.5,375.5,399.1))

MMline(time, distance, dfr, plot.=TRUE)
## intercept    slope
##   -113.6     520.0
par(new=TRUE)# should plot identically!
MMline(time, distance, dfr[sample.int(nrow(dfr)), ], plot.=TRUE)

## Note the slightly odd way of specifying the groups. The instructions are
## quite picky about how you define group sizes, so the more obvious method
## of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.

## edited Jul 12 '10 at 13:49 / answered Jul 12 '10 at 13:36
## Richie Cotton

## And someone remarked  that R's  line() did not give the same:

with(dfr, line(time, distance))
## ...
## Coefficients:
## [1]  -108.8   505.2
abline(-108.8, 505.2, col="blue") ##=> this one is wrong

## MM:
(cfs <- t(sapply(setNames(,2:50), function(k) {x <- 1:k; MMline(x,x)})))
##    intercept slope
## 2          0     1 # (special case fixed)
## 3          0     1
## 4          0     1
## 5          0     1
##   ............
## 49         0     1
## 50         0     1

## "Harder": (sort ing!)
cf2 <- t(sapply(setNames(,2:50),
                function(k) {x <- sample.int(k); MMline(x, -10*x)}))
stopifnot(abs(cf2[,"intercept"] -   0) < 1e-12,
          abs(cf2[,    "slope"] - -10) < 1e-12)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
Le 30/05/2017 à 09:33, Martin Maechler a écrit :
...
> However, even after the patch,
> The example from the SO post differs from the result of Richie
> Cotton's function...
The explanation is quite simple. In SO function, the first 1/3 quantile
of used example counts 6 points (of 19 in total), while line()'s
definition of quantile leads to 8 points. The same numbers
(6 and 8) are on the other end of sample. In x sample, there are
few repeated values, this is certainly be the reason of different quantiles..

I am not sure that one quantile definition is better or more correct
than the other. So I would leave line()'s definition as is.

Best,
Sergueï.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Martin Maechler
>>>>> Serguei Sokol <[hidden email]>
>>>>>     on Tue, 30 May 2017 16:01:17 +0200 writes:

    > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
    >> However, even after the patch, The example from the SO
    >> post differs from the result of Richie Cotton's
    >> function...
    > The explanation is quite simple. In SO function, the first
    > 1/3 quantile of used example counts 6 points (of 19 in
    > total), while line()'s definition of quantile leads to 8
    > points. The same numbers (6 and 8) are on the other end of
    > sample.

so the number of obs. for the three thirds for line() are
   {8, 3, 8}  in line()  [also, after your patch, right?]

whereas in MMline() they are as they should be, namely

   {6, 7, 6}

But the  {8, 3, 8}  split is not at all what all "the literature",
including Tukey himself says that "should" be done.
(Other literature on the topic suggests that the optimal sizes
 of the split in three groups depends on the distribution of x ..)

OTOH, MMline() does exactly what "the literature" and also  the
reference on the  ?line  help pages says.

    > In x sample, there are few repeated values, this
    > is certainly be the reason of different quantiles..

    > I am not sure that one quantile definition is better or
    > more correct than the other.

    > So I would leave line()'s definition as is.

you mean  _after_ applying your patch, I assume.

I currently tend do disagree. If we change line() we should
rather fix more ..
Note the 'Subject' you've chosen for this thread,
 "... does not produce the correct Tukey line",
so I think we should get better.

Apart from Richie / my  MMline() function, I've also noticed
that   ACSWR :: resistant_line()
exists.

However "the literature" (see references below), notably the two
with Hoaglin, strongly  recommends smarter iterations, and
-- lo and behold! -- when this topic came up last (for me) in
Dec. 2014, I did spend about 2 days work (or more?) to get the
FORTRAN code from the 1981 - book (which is abbreviated the
"ABC of EDA") from a somewhat useful OCR scan into compilable
Fortran code and then f2c'ed, wrote an R interface function
found problems i.e., bugs, including infinite loops, fixed most
AFAICS, but somehow did not finish making the result available.

Yes, and I have too many other things on my desk... this will
have to wait!

References:

     Tukey, J. W. (1977).  _Exploratory Data Analysis_, Reading
     Massachusetts: Addison-Wesley.

     Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics
     and Computing of Exploratory Data Analysis_ Duxbury Press.

     Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y
     versus x.  Chapter 5 of _Understanding Robust and Exploratory Data
     Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W.
     Tukey.  Wiley.

     Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line
     and Related Regression Methods.  _Journal of the American
     Statistical Association_ *80*, 1041-1054.  <URL:
     https://dx.doi.org/10.1080/01621459.1985.10478222>


    > Best, Sergueï.

Martin Maechler, ETH Zurich (and R core team)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

GlenB
Martin Maechler says in reply to Sergueï Sokol

> Note the 'Subject' you've chosen for this thread,
 "... does not produce the correct Tukey line",

The choice of title was mine not Serguei's; I posted the original message
where the error was pointed out

I agree with Martin's assessment that the correct split  (both by Tukey's
lights and by general practice)
for 19 points would be 6,7,6 and I also agree that it's better to "fix
more" in this instance, where possible.
(e.g. Johnstone&Velleman's standard errors would be a nice thing to add if
feasible) --
but if any blame is attached to the choice of title, it  really should be
aimed at me.

Glen

On Wed, May 31, 2017 at 2:51 AM, Martin Maechler <[hidden email]
> wrote:

> >>>>> Serguei Sokol <[hidden email]>
> >>>>>     on Tue, 30 May 2017 16:01:17 +0200 writes:
>
>     > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
>     >> However, even after the patch, The example from the SO
>     >> post differs from the result of Richie Cotton's
>     >> function...
>     > The explanation is quite simple. In SO function, the first
>     > 1/3 quantile of used example counts 6 points (of 19 in
>     > total), while line()'s definition of quantile leads to 8
>     > points. The same numbers (6 and 8) are on the other end of
>     > sample.
>
> so the number of obs. for the three thirds for line() are
>    {8, 3, 8}  in line()  [also, after your patch, right?]
>
> whereas in MMline() they are as they should be, namely
>
>    {6, 7, 6}
>
> But the  {8, 3, 8}  split is not at all what all "the literature",
> including Tukey himself says that "should" be done.
> (Other literature on the topic suggests that the optimal sizes
>  of the split in three groups depends on the distribution of x ..)
>
> OTOH, MMline() does exactly what "the literature" and also  the
> reference on the  ?line  help pages says.
>
>     > In x sample, there are few repeated values, this
>     > is certainly be the reason of different quantiles..
>
>     > I am not sure that one quantile definition is better or
>     > more correct than the other.
>
>     > So I would leave line()'s definition as is.
>
> you mean  _after_ applying your patch, I assume.
>
> I currently tend do disagree. If we change line() we should
> rather fix more ..
> Note the 'Subject' you've chosen for this thread,
>  "... does not produce the correct Tukey line",
> so I think we should get better.
>
> Apart from Richie / my  MMline() function, I've also noticed
> that   ACSWR :: resistant_line()
> exists.
>
> However "the literature" (see references below), notably the two
> with Hoaglin, strongly  recommends smarter iterations, and
> -- lo and behold! -- when this topic came up last (for me) in
> Dec. 2014, I did spend about 2 days work (or more?) to get the
> FORTRAN code from the 1981 - book (which is abbreviated the
> "ABC of EDA") from a somewhat useful OCR scan into compilable
> Fortran code and then f2c'ed, wrote an R interface function
> found problems i.e., bugs, including infinite loops, fixed most
> AFAICS, but somehow did not finish making the result available.
>
> Yes, and I have too many other things on my desk... this will
> have to wait!
>
> References:
>
>      Tukey, J. W. (1977).  _Exploratory Data Analysis_, Reading
>      Massachusetts: Addison-Wesley.
>
>      Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics
>      and Computing of Exploratory Data Analysis_ Duxbury Press.
>
>      Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y
>      versus x.  Chapter 5 of _Understanding Robust and Exploratory Data
>      Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W.
>      Tukey.  Wiley.
>
>      Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line
>      and Related Regression Methods.  _Journal of the American
>      Statistical Association_ *80*, 1041-1054.  <URL:
>      https://dx.doi.org/10.1080/01621459.1985.10478222>
>
>
>     > Best, Sergueï.
>
> Martin Maechler, ETH Zurich (and R core team)
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
In reply to this post by Martin Maechler
Le 30/05/2017 à 18:51, Martin Maechler a écrit :

>>>>>> Serguei Sokol <[hidden email]>
>>>>>>      on Tue, 30 May 2017 16:01:17 +0200 writes:
>      > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
>      >> However, even after the patch, The example from the SO
>      >> post differs from the result of Richie Cotton's
>      >> function...
>      > The explanation is quite simple. In SO function, the first
>      > 1/3 quantile of used example counts 6 points (of 19 in
>      > total), while line()'s definition of quantile leads to 8
>      > points. The same numbers (6 and 8) are on the other end of
>      > sample.
>
> so the number of obs. for the three thirds for line() are
>     {8, 3, 8}  in line()  [also, after your patch, right?]
>
> whereas in MMline() they are as they should be, namely
>
>     {6, 7, 6}
>
> But the  {8, 3, 8}  split is not at all what all "the literature",
> including Tukey himself says that "should" be done.
> (Other literature on the topic suggests that the optimal sizes
>   of the split in three groups depends on the distribution of x ..)
>
> OTOH, MMline() does exactly what "the literature" and also  the
> reference on the  ?line  help pages says.
Well, what I have seen so far in "literature" was mention of 1/3 quantiles
(but, yes I could overlook smth as I did not spend too much time on it)
So the sample distribution in three groups boils down to a particular quantile
definition to use. It turns out that the line()'s version (you are right, _after_ the patch
but my patch left this definition untouched) is consistent with the R's one.
If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6
(and the same on the 2/3 end).
To my mind, consistency with the rest of R, namely with the quantile definition,
is an argument good enough to let the line()'s definition as is.

Serguei.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Joris FA Meys
OTOH,

> sapply(1:9, function(i){
+   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
+ })
[1] 8 8 6 6 6 6 8 6 6

Only the default (type = 7) and the first two types give the result lines()
gives now. I think there is plenty of reasons to give why any of the other
6 types might be better suited in Tukey's method.

So to my mind, chaning the definition of line() to give sensible output
that is in accordance with the theory, does not imply any inconsistency
with the quantile definition in R. At least not with 6 out of the 9
different ones ;-)

Cheers
Joris

On Wed, May 31, 2017 at 3:06 PM, Serguei Sokol <[hidden email]>
wrote:

> Le 30/05/2017 à 18:51, Martin Maechler a écrit :
>
>> Serguei Sokol <[hidden email]>
>>>>>>>      on Tue, 30 May 2017 16:01:17 +0200 writes:
>>>>>>>
>>>>>>      > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
>>      >> However, even after the patch, The example from the SO
>>      >> post differs from the result of Richie Cotton's
>>      >> function...
>>      > The explanation is quite simple. In SO function, the first
>>      > 1/3 quantile of used example counts 6 points (of 19 in
>>      > total), while line()'s definition of quantile leads to 8
>>      > points. The same numbers (6 and 8) are on the other end of
>>      > sample.
>>
>> so the number of obs. for the three thirds for line() are
>>     {8, 3, 8}  in line()  [also, after your patch, right?]
>>
>> whereas in MMline() they are as they should be, namely
>>
>>     {6, 7, 6}
>>
>> But the  {8, 3, 8}  split is not at all what all "the literature",
>> including Tukey himself says that "should" be done.
>> (Other literature on the topic suggests that the optimal sizes
>>   of the split in three groups depends on the distribution of x ..)
>>
>> OTOH, MMline() does exactly what "the literature" and also  the
>> reference on the  ?line  help pages says.
>>
> Well, what I have seen so far in "literature" was mention of 1/3 quantiles
> (but, yes I could overlook smth as I did not spend too much time on it)
> So the sample distribution in three groups boils down to a particular
> quantile
> definition to use. It turns out that the line()'s version (you are right,
> _after_ the patch
> but my patch left this definition untouched) is consistent with the R's
> one.
> If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6
> (and the same on the 2/3 end).
> To my mind, consistency with the rest of R, namely with the quantile
> definition,
> is an argument good enough to let the line()'s definition as is.
>
> Serguei.
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol
Le 31/05/2017 à 15:40, Joris Meys a écrit :

> OTOH,
>
> > sapply(1:9, function(i){
> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
> + })
> [1] 8 8 6 6 6 6 8 6 6
>
> Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other 6 types
> might be better suited in Tukey's method.
>
> So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with the
> quantile definition in R. At least not with 6 out of the 9 different ones ;-)
Nice shot.
But OTOE (on the other end ;)
 > sapply(1:9, function(i){
+   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
+ })
[1] 8 8 8 8 6 6 8 6 6

Here "8" gains 5 votes against 4 for "6". There were two defector methods
that changed the point number and should be discarded. Which leaves us
with the score 3:4, still in favor of "6" but the default method should prevail
in my sens.

Serguei.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Joris FA Meys
Seriously, if a method gives a wrong result, it's wrong. line() does NOT
implement the algorithm of Tukey, even not after the patch. We're not
discussing Excel here, are we?

The method of Tukey is rather clear, and it is NOT using the default
quantile definition from the quantile function. Actually, it doesn't even
use quantiles to define the groups. It just says that the groups should be
more or less equally spaced. As the method of Tukey relies on the medians
of the subgroups, it would make sense to pick a method that is
approximately unbiased with regard to the median. That would be type 8
imho.

To get the size of the outer groups, Tukey would've been more than happy
enough with a:

> floor(length(dfr$time) / 3)
[1] 6

There you have the size of your left and right group, and now we can
discuss about which median type should be used for the robust fitting.

But I can honestly not understand why anyone in his right mind would defend
a method that is clearly wrong while not working at Microsoft's spreadsheet
department.

Cheers
Joris

On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]>
wrote:

> Le 31/05/2017 à 15:40, Joris Meys a écrit :
>
>> OTOH,
>>
>> > sapply(1:9, function(i){
>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>> + })
>> [1] 8 8 6 6 6 6 8 6 6
>>
>> Only the default (type = 7) and the first two types give the result
>> lines() gives now. I think there is plenty of reasons to give why any of
>> the other 6 types might be better suited in Tukey's method.
>>
>> So to my mind, chaning the definition of line() to give sensible output
>> that is in accordance with the theory, does not imply any inconsistency
>> with the quantile definition in R. At least not with 6 out of the 9
>> different ones ;-)
>>
> Nice shot.
> But OTOE (on the other end ;)
> > sapply(1:9, function(i){
> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
> + })
> [1] 8 8 8 8 6 6 8 6 6
>
> Here "8" gains 5 votes against 4 for "6". There were two defector methods
> that changed the point number and should be discarded. Which leaves us
> with the score 3:4, still in favor of "6" but the default method should
> prevail
> in my sens.
>
> Serguei.
>



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Joris FA Meys
And with "equally spaced" I obviously meant "of equal size". It's getting
too hot in the office here...

On Wed, May 31, 2017 at 4:39 PM, Joris Meys <[hidden email]> wrote:

> Seriously, if a method gives a wrong result, it's wrong. line() does NOT
> implement the algorithm of Tukey, even not after the patch. We're not
> discussing Excel here, are we?
>
> The method of Tukey is rather clear, and it is NOT using the default
> quantile definition from the quantile function. Actually, it doesn't even
> use quantiles to define the groups. It just says that the groups should be
> more or less equally spaced. As the method of Tukey relies on the medians
> of the subgroups, it would make sense to pick a method that is
> approximately unbiased with regard to the median. That would be type 8
> imho.
>
> To get the size of the outer groups, Tukey would've been more than happy
> enough with a:
>
> > floor(length(dfr$time) / 3)
> [1] 6
>
> There you have the size of your left and right group, and now we can
> discuss about which median type should be used for the robust fitting.
>
> But I can honestly not understand why anyone in his right mind would
> defend a method that is clearly wrong while not working at Microsoft's
> spreadsheet department.
>
> Cheers
> Joris
>
> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]>
> wrote:
>
>> Le 31/05/2017 à 15:40, Joris Meys a écrit :
>>
>>> OTOH,
>>>
>>> > sapply(1:9, function(i){
>>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>>> + })
>>> [1] 8 8 6 6 6 6 8 6 6
>>>
>>> Only the default (type = 7) and the first two types give the result
>>> lines() gives now. I think there is plenty of reasons to give why any of
>>> the other 6 types might be better suited in Tukey's method.
>>>
>>> So to my mind, chaning the definition of line() to give sensible output
>>> that is in accordance with the theory, does not imply any inconsistency
>>> with the quantile definition in R. At least not with 6 out of the 9
>>> different ones ;-)
>>>
>> Nice shot.
>> But OTOE (on the other end ;)
>> > sapply(1:9, function(i){
>> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>> + })
>> [1] 8 8 8 8 6 6 8 6 6
>>
>> Here "8" gains 5 votes against 4 for "6". There were two defector methods
>> that changed the point number and should be discarded. Which leaves us
>> with the score 3:4, still in favor of "6" but the default method should
>> prevail
>> in my sens.
>>
>> Serguei.
>>
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
> [hidden email]
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>



--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Peter Dalgaard-2

> On 31 May 2017, at 16:40 , Joris Meys <[hidden email]> wrote:
>
> And with "equally spaced" I obviously meant "of equal size". It's getting
> too hot in the office here...

We have a fair amount of cool westerly wind up here that I could transfer to you via  WWTP (Wind and Weather Transport Protocol). If you open up a sufficiently large pipe, that is.

Anyways, in the past we have tried to follow Tukey's instructions on details like the definition of the "hinges" on boxplots, so presumably we should try and do likewise for this case.

I suspect that Tukey would say "divide the data into three roughly equal-sized groups" or some such. The obvious thing to do would be to allocate N %/% 3 to each group and then the N %% 3 remaining symmetrically and as evenly as possible, which in my book would rather be (1,0,1) than (0, 2, 0) for the case N %% 3 == 2. If  N %% 3 == 1, there is no alternative to (0, 1, 0) by this logic.

>
> On Wed, May 31, 2017 at 4:39 PM, Joris Meys <[hidden email]> wrote:
>
>> Seriously, if a method gives a wrong result, it's wrong. line() does NOT
>> implement the algorithm of Tukey, even not after the patch. We're not
>> discussing Excel here, are we?
>>
>> The method of Tukey is rather clear, and it is NOT using the default
>> quantile definition from the quantile function. Actually, it doesn't even
>> use quantiles to define the groups. It just says that the groups should be
>> more or less equally spaced. As the method of Tukey relies on the medians
>> of the subgroups, it would make sense to pick a method that is
>> approximately unbiased with regard to the median. That would be type 8
>> imho.
>>
>> To get the size of the outer groups, Tukey would've been more than happy
>> enough with a:
>>
>>> floor(length(dfr$time) / 3)
>> [1] 6
>>
>> There you have the size of your left and right group, and now we can
>> discuss about which median type should be used for the robust fitting.
>>
>> But I can honestly not understand why anyone in his right mind would
>> defend a method that is clearly wrong while not working at Microsoft's
>> spreadsheet department.
>>
>> Cheers
>> Joris
>>
>> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <[hidden email]>
>> wrote:
>>
>>> Le 31/05/2017 à 15:40, Joris Meys a écrit :
>>>
>>>> OTOH,
>>>>
>>>>> sapply(1:9, function(i){
>>>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>>>> + })
>>>> [1] 8 8 6 6 6 6 8 6 6
>>>>
>>>> Only the default (type = 7) and the first two types give the result
>>>> lines() gives now. I think there is plenty of reasons to give why any of
>>>> the other 6 types might be better suited in Tukey's method.
>>>>
>>>> So to my mind, chaning the definition of line() to give sensible output
>>>> that is in accordance with the theory, does not imply any inconsistency
>>>> with the quantile definition in R. At least not with 6 out of the 9
>>>> different ones ;-)
>>>>
>>> Nice shot.
>>> But OTOE (on the other end ;)
>>>> sapply(1:9, function(i){
>>> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>>> + })
>>> [1] 8 8 8 8 6 6 8 6 6
>>>
>>> Here "8" gains 5 votes against 4 for "6". There were two defector methods
>>> that changed the point number and should be discarded. Which leaves us
>>> with the score 3:4, still in favor of "6" but the default method should
>>> prevail
>>> in my sens.
>>>
>>> Serguei.
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Mathematical Modelling, Statistics and Bio-Informatics
>>
>> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
>> [hidden email]
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> [hidden email]
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
12