split() - unexpected sorting of results

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

split() - unexpected sorting of results

Peter Meissner-3
Hey,

I found this - for me - quite surprising and puzzling behaviour of split().


split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of input
== sorting of output -- but when using a character vector everything gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens when
using split, I did not find any note on this - for me - rather unexpected
behaviour.


I would like it best when the sorting of split results stays the same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and maybe an
example might be valuable.


Best, Peter

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: split() - unexpected sorting of results

Iñaki Úcar
Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner <[hidden email]>:

> Hey,
>
> I found this - for me - quite surprising and puzzling behaviour of split().
>
>
> split(1:11, as.character(1:11))
> split(1:11, 1:11)
>
>
> When splitting by numerics everything works as expected - sorting of input
> == sorting of output -- but when using a character vector everything gets
> re-sorted alphabetical.
>
>
> Although, there are some references in the help files to what happens when
> using split, I did not find any note on this - for me - rather unexpected
> behaviour.

As the documentation states,

       f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
          grouping, or a list of such factors in which case their
          interaction is used for the grouping.

And, in fact,

> as.factor(1:11)
 [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11

> as.factor(as.character(1:11))
 [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
Iñaki

> I would like it best when the sorting of split results stays the same no
> matter the input (sorting of input == sorting of output)
>
> If that is not possibly a note of caution in the help pages and maybe an
> example might be valuable.
>
>
> Best, Peter
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: split() - unexpected sorting of results

Peter Meissner-3
Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled better.

Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <[hidden email]>:

> Hi Peter,
>
> 2017-10-20 21:33 GMT+02:00 Peter Meissner <[hidden email]>:
> > Hey,
> >
> > I found this - for me - quite surprising and puzzling behaviour of
> split().
> >
> >
> > split(1:11, as.character(1:11))
> > split(1:11, 1:11)
> >
> >
> > When splitting by numerics everything works as expected - sorting of
> input
> > == sorting of output -- but when using a character vector everything gets
> > re-sorted alphabetical.
> >
> >
> > Although, there are some references in the help files to what happens
> when
> > using split, I did not find any note on this - for me - rather unexpected
> > behaviour.
>
> As the documentation states,
>
>        f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>           grouping, or a list of such factors in which case their
>           interaction is used for the grouping.
>
> And, in fact,
>
> > as.factor(1:11)
>  [1] 1  2  3  4  5  6  7  8  9  10 11
> Levels: 1 2 3 4 5 6 7 8 9 10 11
>
> > as.factor(as.character(1:11))
>  [1] 1  2  3  4  5  6  7  8  9  10 11
> Levels: 1 10 11 2 3 4 5 6 7 8 9
>
> Regards,
> Iñaki
>
> > I would like it best when the sorting of split results stays the same no
> > matter the input (sorting of input == sorting of output)
> >
> > If that is not possibly a note of caution in the help pages and maybe an
> > example might be valuable.
> >
> >
> > Best, Peter
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: split() - unexpected sorting of results

Hervé Pagès-2
Hi,

On 10/20/2017 12:53 PM, Peter Meissner wrote:
> Thanks, for the explanation.
>
> Still, I think this is surprising bahaviour which might be handled better.

Maybe a little surprising, but no more than:

 > x <- sample(11L)

 > sort(x)
  [1]  1  2  3  4  5  6  7  8  9 10 11

 > sort(as.character(x))
  [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

The fact that sort(), as.factor(), split() and many other things behave
consistently with respect to the underlying order of character vectors
avoids other even bigger surprises.

Also note that the underlying order of character vectors actually
depends on your locale. One way to guarantee consistent results across
platforms/locales is by explicitly specifying the levels when making
a factor e.g.

   f <- factor(x, levels=unique(x))
   split(1:11, f)

This is particularly sensible when writing unit tests.

Cheers,
H.

>
> Best, Peter
>
> Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <[hidden email]>:
>
>> Hi Peter,
>>
>> 2017-10-20 21:33 GMT+02:00 Peter Meissner <[hidden email]>:
>>> Hey,
>>>
>>> I found this - for me - quite surprising and puzzling behaviour of
>> split().
>>>
>>>
>>> split(1:11, as.character(1:11))
>>> split(1:11, 1:11)
>>>
>>>
>>> When splitting by numerics everything works as expected - sorting of
>> input
>>> == sorting of output -- but when using a character vector everything gets
>>> re-sorted alphabetical.
>>>
>>>
>>> Although, there are some references in the help files to what happens
>> when
>>> using split, I did not find any note on this - for me - rather unexpected
>>> behaviour.
>>
>> As the documentation states,
>>
>>         f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>>            grouping, or a list of such factors in which case their
>>            interaction is used for the grouping.
>>
>> And, in fact,
>>
>>> as.factor(1:11)
>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>> Levels: 1 2 3 4 5 6 7 8 9 10 11
>>
>>> as.factor(as.character(1:11))
>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>> Levels: 1 10 11 2 3 4 5 6 7 8 9
>>
>> Regards,
>> Iñaki
>>
>>> I would like it best when the sorting of split results stays the same no
>>> matter the input (sorting of input == sorting of output)
>>>
>>> If that is not possibly a note of caution in the help pages and maybe an
>>> example might be valuable.
>>>
>>>
>>> Best, Peter
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: split() - unexpected sorting of results

Rui Barradas
Hello,

In order to solve that problem of sorting numerics made characters there
is package stringr, functions str_sort and str_order.

library(stringr)

set.seed(2447)

x <- sample(11L)
sort(as.character(x))
[1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

str_sort(as.character(x), numeric = TRUE)
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

str_order(as.character(x), numeric = TRUE)
#[1]  1  4 11  8  6  5  3 10  9  7  2

i <- str_order(as.character(x), numeric = TRUE)
as.character(x)[i]
#[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"


Unfortunately this does not solve the OP's question, factor(),
as.factor(), split() and others use the base R sorter and this can only
be changed by changing their sources.

Hope this helps,

Rui Barradas

Em 21-10-2017 00:32, Hervé Pagès escreveu:

> Hi,
>
> On 10/20/2017 12:53 PM, Peter Meissner wrote:
>> Thanks, for the explanation.
>>
>> Still, I think this is surprising bahaviour which might be handled
>> better.
>
> Maybe a little surprising, but no more than:
>
>  > x <- sample(11L)
>
>  > sort(x)
>   [1]  1  2  3  4  5  6  7  8  9 10 11
>
>  > sort(as.character(x))
>   [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"
>
> The fact that sort(), as.factor(), split() and many other things behave
> consistently with respect to the underlying order of character vectors
> avoids other even bigger surprises.
>
> Also note that the underlying order of character vectors actually
> depends on your locale. One way to guarantee consistent results across
> platforms/locales is by explicitly specifying the levels when making
> a factor e.g.
>
>    f <- factor(x, levels=unique(x))
>    split(1:11, f)
>
> This is particularly sensible when writing unit tests.
>
> Cheers,
> H.
>
>>
>> Best, Peter
>>
>> Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <[hidden email]>:
>>
>>> Hi Peter,
>>>
>>> 2017-10-20 21:33 GMT+02:00 Peter Meissner <[hidden email]>:
>>>> Hey,
>>>>
>>>> I found this - for me - quite surprising and puzzling behaviour of
>>> split().
>>>>
>>>>
>>>> split(1:11, as.character(1:11))
>>>> split(1:11, 1:11)
>>>>
>>>>
>>>> When splitting by numerics everything works as expected - sorting of
>>> input
>>>> == sorting of output -- but when using a character vector everything
>>>> gets
>>>> re-sorted alphabetical.
>>>>
>>>>
>>>> Although, there are some references in the help files to what happens
>>> when
>>>> using split, I did not find any note on this - for me - rather
>>>> unexpected
>>>> behaviour.
>>>
>>> As the documentation states,
>>>
>>>         f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>>>            grouping, or a list of such factors in which case their
>>>            interaction is used for the grouping.
>>>
>>> And, in fact,
>>>
>>>> as.factor(1:11)
>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>> Levels: 1 2 3 4 5 6 7 8 9 10 11
>>>
>>>> as.factor(as.character(1:11))
>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>> Levels: 1 10 11 2 3 4 5 6 7 8 9
>>>
>>> Regards,
>>> Iñaki
>>>
>>>> I would like it best when the sorting of split results stays the
>>>> same no
>>>> matter the input (sorting of input == sorting of output)
>>>>
>>>> If that is not possibly a note of caution in the help pages and
>>>> maybe an
>>>> example might be valuable.
>>>>
>>>>
>>>> Best, Peter
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>>
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: split() - unexpected sorting of results

Peter Meissner-3
Thank you all for your input - most appreciated.

Best, Peter

Am 21.10.2017 07:35 schrieb "Rui Barradas" <[hidden email]>:

> Hello,
>
> In order to solve that problem of sorting numerics made characters there
> is package stringr, functions str_sort and str_order.
>
> library(stringr)
>
> set.seed(2447)
>
> x <- sample(11L)
> sort(as.character(x))
> [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"
>
> str_sort(as.character(x), numeric = TRUE)
> [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
>
> str_order(as.character(x), numeric = TRUE)
> #[1]  1  4 11  8  6  5  3 10  9  7  2
>
> i <- str_order(as.character(x), numeric = TRUE)
> as.character(x)[i]
> #[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
>
>
> Unfortunately this does not solve the OP's question, factor(),
> as.factor(), split() and others use the base R sorter and this can only be
> changed by changing their sources.
>
> Hope this helps,
>
> Rui Barradas
>
> Em 21-10-2017 00:32, Hervé Pagès escreveu:
>
>> Hi,
>>
>> On 10/20/2017 12:53 PM, Peter Meissner wrote:
>>
>>> Thanks, for the explanation.
>>>
>>> Still, I think this is surprising bahaviour which might be handled
>>> better.
>>>
>>
>> Maybe a little surprising, but no more than:
>>
>>  > x <- sample(11L)
>>
>>  > sort(x)
>>   [1]  1  2  3  4  5  6  7  8  9 10 11
>>
>>  > sort(as.character(x))
>>   [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"
>>
>> The fact that sort(), as.factor(), split() and many other things behave
>> consistently with respect to the underlying order of character vectors
>> avoids other even bigger surprises.
>>
>> Also note that the underlying order of character vectors actually
>> depends on your locale. One way to guarantee consistent results across
>> platforms/locales is by explicitly specifying the levels when making
>> a factor e.g.
>>
>>    f <- factor(x, levels=unique(x))
>>    split(1:11, f)
>>
>> This is particularly sensible when writing unit tests.
>>
>> Cheers,
>> H.
>>
>>
>>> Best, Peter
>>>
>>> Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <[hidden email]>:
>>>
>>> Hi Peter,
>>>>
>>>> 2017-10-20 21:33 GMT+02:00 Peter Meissner <[hidden email]>:
>>>>
>>>>> Hey,
>>>>>
>>>>> I found this - for me - quite surprising and puzzling behaviour of
>>>>>
>>>> split().
>>>>
>>>>>
>>>>>
>>>>> split(1:11, as.character(1:11))
>>>>> split(1:11, 1:11)
>>>>>
>>>>>
>>>>> When splitting by numerics everything works as expected - sorting of
>>>>>
>>>> input
>>>>
>>>>> == sorting of output -- but when using a character vector everything
>>>>> gets
>>>>> re-sorted alphabetical.
>>>>>
>>>>>
>>>>> Although, there are some references in the help files to what happens
>>>>>
>>>> when
>>>>
>>>>> using split, I did not find any note on this - for me - rather
>>>>> unexpected
>>>>> behaviour.
>>>>>
>>>>
>>>> As the documentation states,
>>>>
>>>>         f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>>>>            grouping, or a list of such factors in which case their
>>>>            interaction is used for the grouping.
>>>>
>>>> And, in fact,
>>>>
>>>> as.factor(1:11)
>>>>>
>>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>>> Levels: 1 2 3 4 5 6 7 8 9 10 11
>>>>
>>>> as.factor(as.character(1:11))
>>>>>
>>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>>> Levels: 1 10 11 2 3 4 5 6 7 8 9
>>>>
>>>> Regards,
>>>> Iñaki
>>>>
>>>> I would like it best when the sorting of split results stays the
>>>>> same no
>>>>> matter the input (sorting of input == sorting of output)
>>>>>
>>>>> If that is not possibly a note of caution in the help pages and
>>>>> maybe an
>>>>> example might be valuable.
>>>>>
>>>>>
>>>>> Best, Peter
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>>>> hz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84V
>>>>> tBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZ
>>>>> T7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCD
>>>>> PAclXHoc9_le3Z1DrZg0nQqg&e=
>>>>>
>>>>>
>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
>>> hz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84V
>>> tBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZ
>>> T7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCD
>>> PAclXHoc9_le3Z1DrZg0nQqg&e=
>>>
>>>
>>>
>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel