Faster sorting algorithm...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Faster sorting algorithm...

MorganMorgan
Hi,
I am not sure if this is the right mailing list, so apologies in advance if
it is not.

I found the following link/presentation:
https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf

The implementation of fsort is interesting but incomplete (not sure why?)
and can be improved or made faster (at least 25%  I believe). I might be
wrong but there are maybe a couple of bugs as well.

My questions are:

1/ Is the R Core team interested in a faster sorting algo? (Multithread or
even single threaded)

2/ I see an issue with the license, which is MPL-2.0, and hence not
compatible with base R, Python and Julia. Is there an interest to change
the license of fsort so all 3 languages (and all the people using these
languages) can benefit from it? (Like suggested on the first page)

Please let me know if there is an interest to address the above points, I
would be happy to look into it (free of charge of course!).

Thank you
Best regards
Morgan

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

Avraham Adler
Isn’t the default method now “radix” which is the data.table sort, and
isn’t that already parallel using openmp where available?

Avi

On Mon, Mar 15, 2021 at 12:26 PM Morgan Morgan <[hidden email]>
wrote:

> Hi,
> I am not sure if this is the right mailing list, so apologies in advance if
> it is not.
>
> I found the following link/presentation:
> https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf
>
> The implementation of fsort is interesting but incomplete (not sure why?)
> and can be improved or made faster (at least 25%  I believe). I might be
> wrong but there are maybe a couple of bugs as well.
>
> My questions are:
>
> 1/ Is the R Core team interested in a faster sorting algo? (Multithread or
> even single threaded)
>
> 2/ I see an issue with the license, which is MPL-2.0, and hence not
> compatible with base R, Python and Julia. Is there an interest to change
> the license of fsort so all 3 languages (and all the people using these
> languages) can benefit from it? (Like suggested on the first page)
>
> Please let me know if there is an interest to address the above points, I
> would be happy to look into it (free of charge of course!).
>
> Thank you
> Best regards
> Morgan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Sent from Gmail Mobile

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

MorganMorgan
Default method for sort is not radix(especially for character vector). You
might want to read the documentation of sort.
For your second question, I invite you to look at the code of fsort. It is
implemented only for positive finite double, and default to
data.table:::forder ... when the types are different than positive double...
Please read the pdf link I sent, everything is explained in it.
Thank you
Morgan

On Mon, 15 Mar 2021, 16:52 Avraham Adler, <[hidden email]> wrote:

> Isn’t the default method now “radix” which is the data.table sort, and
> isn’t that already parallel using openmp where available?
>
> Avi
>
> On Mon, Mar 15, 2021 at 12:26 PM Morgan Morgan <[hidden email]>
> wrote:
>
>> Hi,
>> I am not sure if this is the right mailing list, so apologies in advance
>> if
>> it is not.
>>
>> I found the following link/presentation:
>> https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf
>>
>> The implementation of fsort is interesting but incomplete (not sure why?)
>> and can be improved or made faster (at least 25%  I believe). I might be
>> wrong but there are maybe a couple of bugs as well.
>>
>> My questions are:
>>
>> 1/ Is the R Core team interested in a faster sorting algo? (Multithread or
>> even single threaded)
>>
>> 2/ I see an issue with the license, which is MPL-2.0, and hence not
>> compatible with base R, Python and Julia. Is there an interest to change
>> the license of fsort so all 3 languages (and all the people using these
>> languages) can benefit from it? (Like suggested on the first page)
>>
>> Please let me know if there is an interest to address the above points, I
>> would be happy to look into it (free of charge of course!).
>>
>> Thank you
>> Best regards
>> Morgan
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> --
> Sent from Gmail Mobile
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

aBBy Spurdle, ⍺XY
In reply to this post by MorganMorgan
In principle, I agree that faster ranking/sorting algorithms are
important, and should be a priority.
But I can't help but feel that the paper focuses on textbook-oriented problems.

Given that in real world problems, there's almost always some form of
prior knowledge:
Wouldn't it be better, from a management perspective, to focus on
sorting algorithms, that incorporate that prior knowledge?

I'm not sure whether that's an R-devel discussion, or for another forum...


On Tue, Mar 16, 2021 at 5:25 AM Morgan Morgan <[hidden email]> wrote:

>
> Hi,
> I am not sure if this is the right mailing list, so apologies in advance if
> it is not.
>
> I found the following link/presentation:
> https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf
>
> The implementation of fsort is interesting but incomplete (not sure why?)
> and can be improved or made faster (at least 25%  I believe). I might be
> wrong but there are maybe a couple of bugs as well.
>
> My questions are:
>
> 1/ Is the R Core team interested in a faster sorting algo? (Multithread or
> even single threaded)
>
> 2/ I see an issue with the license, which is MPL-2.0, and hence not
> compatible with base R, Python and Julia. Is there an interest to change
> the license of fsort so all 3 languages (and all the people using these
> languages) can benefit from it? (Like suggested on the first page)
>
> Please let me know if there is an interest to address the above points, I
> would be happy to look into it (free of charge of course!).
>
> Thank you
> Best regards
> Morgan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

Radford Neal
In reply to this post by MorganMorgan
Those interested in faster sorting may want to look at the merge sort
implemented in pqR (see pqR-project.org).  It's often used as the
default, because it is stable, and does different collations, while
being faster than shell sort (except for small vectors).

Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
compiled identically:

-----------------------------
pqR-2020-07-23 in C locale:

> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed
  1.332   0.000   1.334
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed
  0.092   0.004   0.096
> print(system.time (om <- order(x,method="merge")))
   user  system elapsed
  0.363   0.000   0.363
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 1 2

-----------------------------
R-4.0.2 in C locale:

> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed
  2.381   0.004   2.387
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed
  0.138   0.000   0.137
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))

------------------------------------
pqR-2020-07-23 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur     système      écoulé
      2.960       0.000       2.962
> print(system.time (or <- order(x,method="radix")))
utilisateur     système      écoulé
      0.083       0.008       0.092
> print(system.time (om <- order(x,method="merge")))
utilisateur     système      écoulé
      1.143       0.000       1.142
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 2 1

------------------------------------
R-4.0.2 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur     système      écoulé
      4.222       0.016       4.239
> print(system.time (or <- order(x,method="radix")))
utilisateur     système      écoulé
      0.136       0.000       0.137
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))


pqR is faster in all the tests, but more relevant to this discussion
is that the "merge" method is substantially faster than "shell" for
these character vectors with a million elements, while retaining the
stability and collation properties of "shell" (whereas "radix" only
does C collation).

It would probably not be too hard to take the merge sort code from pqR
and use it in R core's implementation.

The merge sort in pqR doesn't exploit parallelism at the moment, but
merge sort is potentially quite parallelizable (though I think the
storage allocation strategy I use would have to be modified).

Regards,

   Radford Neal

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

MorganMorgan
Thank you Neal. This is interesting. I will have a look at pqR.
Indeed radix only does C collation, I believe that is why it is not the
default choice for character ordering and sorting.
Not sure but I believe it can help address the following bugzilla item:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400

On the same topic of collation, there is an experimental sorting function
"psort" in package kit that might help address this issue.

> library(kit)
Attaching kit 0.0.7 (OPENMP enabled using 1 thread)
> x <- c("b","A","B","a","\xe4")
> Encoding(x) <- "latin1"
> identical(psort(x, c.locale=FALSE), sort(x))
[1] TRUE
> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))
[1] TRUE

Coming back to the topic of fsort, I have just finished the implementation
for double, integer, factor and logical.
The implementation takes into account NA, Inf.. values. Values can be
sorted in a decreasing order or increasing order.
Comparing benchmark with the current implementation in data.table, it is
currently over 30% faster.
There might bugs but I am sure performance can be further improved as I did
not really try hard.
If there is interest in both the implementation and cross community
sharing, please let know....

Best regards,
Morgan

On Wed, 17 Mar 2021, 00:37 Radford Neal, <[hidden email]> wrote:

> Those interested in faster sorting may want to look at the merge sort
> implemented in pqR (see pqR-project.org).  It's often used as the
> default, because it is stable, and does different collations, while
> being faster than shell sort (except for small vectors).
>
> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
> compiled identically:
>
> -----------------------------
> pqR-2020-07-23 in C locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
>    user  system elapsed
>   1.332   0.000   1.334
> > print(system.time (or <- order(x,method="radix")))
>    user  system elapsed
>   0.092   0.004   0.096
> > print(system.time (om <- order(x,method="merge")))
>    user  system elapsed
>   0.363   0.000   0.363
> > print(identical(os,or))
> [1] TRUE
> > print(identical(os,om))
> [1] TRUE
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 1 2
> > print(order(x,method="radix"))
> [1] 1 2
> > print(order(x,method="merge"))
> [1] 1 2
>
> -----------------------------
> R-4.0.2 in C locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
>    user  system elapsed
>   2.381   0.004   2.387
> > print(system.time (or <- order(x,method="radix")))
>    user  system elapsed
>   0.138   0.000   0.137
> > #print(system.time (om <- order(x,method="merge")))
> > print(identical(os,or))
> [1] TRUE
> > #print(identical(os,om))
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 1 2
> > print(order(x,method="radix"))
> [1] 1 2
> > #print(order(x,method="merge"))
>
> ------------------------------------
> pqR-2020-07-23 in fr_CA.utf8 locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
> utilisateur     système      écoulé
>       2.960       0.000       2.962
> > print(system.time (or <- order(x,method="radix")))
> utilisateur     système      écoulé
>       0.083       0.008       0.092
> > print(system.time (om <- order(x,method="merge")))
> utilisateur     système      écoulé
>       1.143       0.000       1.142
> > print(identical(os,or))
> [1] TRUE
> > print(identical(os,om))
> [1] TRUE
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 2 1
> > print(order(x,method="radix"))
> [1] 1 2
> > print(order(x,method="merge"))
> [1] 2 1
>
> ------------------------------------
> R-4.0.2 in fr_CA.utf8 locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
> utilisateur     système      écoulé
>       4.222       0.016       4.239
> > print(system.time (or <- order(x,method="radix")))
> utilisateur     système      écoulé
>       0.136       0.000       0.137
> > #print(system.time (om <- order(x,method="merge")))
> > print(identical(os,or))
> [1] TRUE
> > #print(identical(os,om))
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 2 1
> > print(order(x,method="radix"))
> [1] 1 2
> > #print(order(x,method="merge"))
>
>
> pqR is faster in all the tests, but more relevant to this discussion
> is that the "merge" method is substantially faster than "shell" for
> these character vectors with a million elements, while retaining the
> stability and collation properties of "shell" (whereas "radix" only
> does C collation).
>
> It would probably not be too hard to take the merge sort code from pqR
> and use it in R core's implementation.
>
> The merge sort in pqR doesn't exploit parallelism at the moment, but
> merge sort is potentially quite parallelizable (though I think the
> storage allocation strategy I use would have to be modified).
>
> Regards,
>
>    Radford Neal
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

frederik-2
I think it is "Professor Neal" :)

I also appreciate the pqR comparisons.

On Wed, Mar 17, 2021 at 09:23:15AM +0000, Morgan Morgan wrote:

>Thank you Neal. This is interesting. I will have a look at pqR.
>Indeed radix only does C collation, I believe that is why it is not the
>default choice for character ordering and sorting.
>Not sure but I believe it can help address the following bugzilla item:
>https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400
>
>On the same topic of collation, there is an experimental sorting function
>"psort" in package kit that might help address this issue.
>
>> library(kit)
>Attaching kit 0.0.7 (OPENMP enabled using 1 thread)
>> x <- c("b","A","B","a","\xe4")
>> Encoding(x) <- "latin1"
>> identical(psort(x, c.locale=FALSE), sort(x))
>[1] TRUE
>> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))
>[1] TRUE
>
>Coming back to the topic of fsort, I have just finished the implementation
>for double, integer, factor and logical.
>The implementation takes into account NA, Inf.. values. Values can be
>sorted in a decreasing order or increasing order.
>Comparing benchmark with the current implementation in data.table, it is
>currently over 30% faster.
>There might bugs but I am sure performance can be further improved as I did
>not really try hard.
>If there is interest in both the implementation and cross community
>sharing, please let know....
>
>Best regards,
>Morgan
>
>On Wed, 17 Mar 2021, 00:37 Radford Neal, <[hidden email]> wrote:
>
>> Those interested in faster sorting may want to look at the merge sort
>> implemented in pqR (see pqR-project.org).  It's often used as the
>> default, because it is stable, and does different collations, while
>> being faster than shell sort (except for small vectors).
>>
>> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
>> compiled identically:
>>
>> -----------------------------
>> pqR-2020-07-23 in C locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>>    user  system elapsed
>>   1.332   0.000   1.334
>> > print(system.time (or <- order(x,method="radix")))
>>    user  system elapsed
>>   0.092   0.004   0.096
>> > print(system.time (om <- order(x,method="merge")))
>>    user  system elapsed
>>   0.363   0.000   0.363
>> > print(identical(os,or))
>> [1] TRUE
>> > print(identical(os,om))
>> [1] TRUE
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 1 2
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > print(order(x,method="merge"))
>> [1] 1 2
>>
>> -----------------------------
>> R-4.0.2 in C locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>>    user  system elapsed
>>   2.381   0.004   2.387
>> > print(system.time (or <- order(x,method="radix")))
>>    user  system elapsed
>>   0.138   0.000   0.137
>> > #print(system.time (om <- order(x,method="merge")))
>> > print(identical(os,or))
>> [1] TRUE
>> > #print(identical(os,om))
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 1 2
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > #print(order(x,method="merge"))
>>
>> ------------------------------------
>> pqR-2020-07-23 in fr_CA.utf8 locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>> utilisateur     système      écoulé
>>       2.960       0.000       2.962
>> > print(system.time (or <- order(x,method="radix")))
>> utilisateur     système      écoulé
>>       0.083       0.008       0.092
>> > print(system.time (om <- order(x,method="merge")))
>> utilisateur     système      écoulé
>>       1.143       0.000       1.142
>> > print(identical(os,or))
>> [1] TRUE
>> > print(identical(os,om))
>> [1] TRUE
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 2 1
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > print(order(x,method="merge"))
>> [1] 2 1
>>
>> ------------------------------------
>> R-4.0.2 in fr_CA.utf8 locale:
>>
>> > set.seed(1)
>> > N <- 1000000
>> > x <- as.character (sample(N,N,replace=TRUE))
>> > print(system.time (os <- order(x,method="shell")))
>> utilisateur     système      écoulé
>>       4.222       0.016       4.239
>> > print(system.time (or <- order(x,method="radix")))
>> utilisateur     système      écoulé
>>       0.136       0.000       0.137
>> > #print(system.time (om <- order(x,method="merge")))
>> > print(identical(os,or))
>> [1] TRUE
>> > #print(identical(os,om))
>> >
>> > x <- c("a","~")
>> > print(order(x,method="shell"))
>> [1] 2 1
>> > print(order(x,method="radix"))
>> [1] 1 2
>> > #print(order(x,method="merge"))
>>
>>
>> pqR is faster in all the tests, but more relevant to this discussion
>> is that the "merge" method is substantially faster than "shell" for
>> these character vectors with a million elements, while retaining the
>> stability and collation properties of "shell" (whereas "radix" only
>> does C collation).
>>
>> It would probably not be too hard to take the merge sort code from pqR
>> and use it in R core's implementation.
>>
>> The merge sort in pqR doesn't exploit parallelism at the moment, but
>> merge sort is potentially quite parallelizable (though I think the
>> storage allocation strategy I use would have to be modified).
>>
>> Regards,
>>
>>    Radford Neal
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Faster sorting algorithm...

MorganMorgan
My apologies to Professor Neal.
Thank you for correcting me.
Best regards
Morgan


On Mon, 22 Mar 2021, 05:05 , <[hidden email]> wrote:

> I think it is "Professor Neal" :)
>
> I also appreciate the pqR comparisons.
>
> On Wed, Mar 17, 2021 at 09:23:15AM +0000, Morgan Morgan wrote:
> >Thank you Neal. This is interesting. I will have a look at pqR.
> >Indeed radix only does C collation, I believe that is why it is not the
> >default choice for character ordering and sorting.
> >Not sure but I believe it can help address the following bugzilla item:
> >https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400
> >
> >On the same topic of collation, there is an experimental sorting function
> >"psort" in package kit that might help address this issue.
> >
> >> library(kit)
> >Attaching kit 0.0.7 (OPENMP enabled using 1 thread)
> >> x <- c("b","A","B","a","\xe4")
> >> Encoding(x) <- "latin1"
> >> identical(psort(x, c.locale=FALSE), sort(x))
> >[1] TRUE
> >> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))
> >[1] TRUE
> >
> >Coming back to the topic of fsort, I have just finished the implementation
> >for double, integer, factor and logical.
> >The implementation takes into account NA, Inf.. values. Values can be
> >sorted in a decreasing order or increasing order.
> >Comparing benchmark with the current implementation in data.table, it is
> >currently over 30% faster.
> >There might bugs but I am sure performance can be further improved as I
> did
> >not really try hard.
> >If there is interest in both the implementation and cross community
> >sharing, please let know....
> >
> >Best regards,
> >Morgan
> >
> >On Wed, 17 Mar 2021, 00:37 Radford Neal, <[hidden email]> wrote:
> >
> >> Those interested in faster sorting may want to look at the merge sort
> >> implemented in pqR (see pqR-project.org).  It's often used as the
> >> default, because it is stable, and does different collations, while
> >> being faster than shell sort (except for small vectors).
> >>
> >> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
> >> compiled identically:
> >>
> >> -----------------------------
> >> pqR-2020-07-23 in C locale:
> >>
> >> > set.seed(1)
> >> > N <- 1000000
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >>    user  system elapsed
> >>   1.332   0.000   1.334
> >> > print(system.time (or <- order(x,method="radix")))
> >>    user  system elapsed
> >>   0.092   0.004   0.096
> >> > print(system.time (om <- order(x,method="merge")))
> >>    user  system elapsed
> >>   0.363   0.000   0.363
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > print(identical(os,om))
> >> [1] TRUE
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 1 2
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > print(order(x,method="merge"))
> >> [1] 1 2
> >>
> >> -----------------------------
> >> R-4.0.2 in C locale:
> >>
> >> > set.seed(1)
> >> > N <- 1000000
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >>    user  system elapsed
> >>   2.381   0.004   2.387
> >> > print(system.time (or <- order(x,method="radix")))
> >>    user  system elapsed
> >>   0.138   0.000   0.137
> >> > #print(system.time (om <- order(x,method="merge")))
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > #print(identical(os,om))
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 1 2
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > #print(order(x,method="merge"))
> >>
> >> ------------------------------------
> >> pqR-2020-07-23 in fr_CA.utf8 locale:
> >>
> >> > set.seed(1)
> >> > N <- 1000000
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >> utilisateur     système      écoulé
> >>       2.960       0.000       2.962
> >> > print(system.time (or <- order(x,method="radix")))
> >> utilisateur     système      écoulé
> >>       0.083       0.008       0.092
> >> > print(system.time (om <- order(x,method="merge")))
> >> utilisateur     système      écoulé
> >>       1.143       0.000       1.142
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > print(identical(os,om))
> >> [1] TRUE
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 2 1
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > print(order(x,method="merge"))
> >> [1] 2 1
> >>
> >> ------------------------------------
> >> R-4.0.2 in fr_CA.utf8 locale:
> >>
> >> > set.seed(1)
> >> > N <- 1000000
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >> utilisateur     système      écoulé
> >>       4.222       0.016       4.239
> >> > print(system.time (or <- order(x,method="radix")))
> >> utilisateur     système      écoulé
> >>       0.136       0.000       0.137
> >> > #print(system.time (om <- order(x,method="merge")))
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > #print(identical(os,om))
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 2 1
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > #print(order(x,method="merge"))
> >>
> >>
> >> pqR is faster in all the tests, but more relevant to this discussion
> >> is that the "merge" method is substantially faster than "shell" for
> >> these character vectors with a million elements, while retaining the
> >> stability and collation properties of "shell" (whereas "radix" only
> >> does C collation).
> >>
> >> It would probably not be too hard to take the merge sort code from pqR
> >> and use it in R core's implementation.
> >>
> >> The merge sort in pqR doesn't exploit parallelism at the moment, but
> >> merge sort is potentially quite parallelizable (though I think the
> >> storage allocation strategy I use would have to be modified).
> >>
> >> Regards,
> >>
> >>    Radford Neal
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel