Quicker quantiles?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Quicker quantiles?

Roger Koenker-2
Motivated by Deepayan's recent inquiries about the efficiency of the  
R 'quantile'
function:

         http://tolstoy.newcastle.edu.au/R/devel/05/11/3305.html
         http://tolstoy.newcastle.edu.au/R/devel/06/03/4358.html

I decided to try to revive an old project to implement a version of  
the Floyd
and Rivest (1975) algorithm for finding quantiles with O(n)  
comparisons.  I
used to have a direct translation from FR's algol68 in my quantreg  
package,
but was forced to remove it when it encountered  g77 compilers that
didn't like the way I had handled recursion.

Fortunately, in the interim, I've corresponded with Krzysztof Kiwiel  
in Warsaw
who has made a detailed study of the algorithm:

K.C. Kiwiel: On Floyd and Rivest's SELECT Algorithm, Theoretical
       Computer Sci. 347 (2005) 214-238.

He has also kindly  agreed to allow me to incorporate his  
implementation of
the FR algorithm in R under GPL.

For the moment, I have made an alpha-test package with a single  
function --
'kuantile' -- that attempts to reproduce the functionality of the  
current
base quantile function, essentially replacing the partial sorting done
there with calls to Kiwiel's 'select' function.  This package is  
available
from:

         http://www.econ.uiuc.edu/~roger/research/rq/kuantile

The good news is that the new function  _is_ somewhat quicker than
the base 'quantile' function.  The following table is based on means
of 10 replications.  The first two columns report times for computing
just the median, the next two columns for computing the five default
quantiles:  seq(0,1,.25), and the last column is the time needed for  
a call
of sort().

         mean system.time()[1] for various calls*

           median only      default 5
n      quantile kuantile quantile kuantile      sort

100        0.002    0.001    0.004    0.004       0.000
1000      0.002    0.001    0.003    0.002       0.002
10000    0.003    0.002    0.009    0.005       0.005
1e+05    0.024    0.010    0.061    0.013       0.031
1e+06    0.206    0.120    0.535    0.142       0.502
1e+07    2.132    0.790    5.575    1.035       7.484

* On a mac G5 running R 2.2.1.

The bad news is that this approach doesn't help with Deepayan's
original question which involved instances in which quantile() was
being called for rather large number of probabilities.  In such
cases it seems that it is difficult to improve upon doing a full sort.

I would welcome comments on any of this....

url:    www.econ.uiuc.edu/~roger    Roger Koenker
email:  [hidden email]           Department of Economics
vox:    217-333-4558                University of Illinois
fax:    217-244-6678                Champaign, IL 61820

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Quicker quantiles?

Brian Ripley
Please do the comparison with R-devel.  That has a different policy,
including switching to quicksort when partial has more than 10 elements.

BTW you need to compare with quicksort (which suffices here and is quite a
lot faster than shellsort for large numeric vectors).

For 1 million+1 I am getting timings on my Windows laptop

sort 0.39
quicksort 0.24
quantile(50%)   0.13
quantile() 0.17
kuantile(50%)   0.16
kuantile()      0.17

(timings for a million are about 50% more in the last four lines as
roughly twice as many raw quantiles are needed.)

Note that there is never more than about a factor of three gain over
quicksort (remembering that quantile has some overhead of its own, so I
also timed a version of quantile always using quicksort, using about
0.34).  Given how rare this is as a task for R, it would probably be quite
sufficient to always use quicksort.


On Sat, 11 Mar 2006, roger koenker wrote:

> Motivated by Deepayan's recent inquiries about the efficiency of the
> R 'quantile'
> function:
>
>         http://tolstoy.newcastle.edu.au/R/devel/05/11/3305.html
>         http://tolstoy.newcastle.edu.au/R/devel/06/03/4358.html
>
> I decided to try to revive an old project to implement a version of
> the Floyd
> and Rivest (1975) algorithm for finding quantiles with O(n)
> comparisons.  I
> used to have a direct translation from FR's algol68 in my quantreg
> package,
> but was forced to remove it when it encountered  g77 compilers that
> didn't like the way I had handled recursion.
>
> Fortunately, in the interim, I've corresponded with Krzysztof Kiwiel
> in Warsaw
> who has made a detailed study of the algorithm:
>
> K.C. Kiwiel: On Floyd and Rivest's SELECT Algorithm, Theoretical
>       Computer Sci. 347 (2005) 214-238.
>
> He has also kindly  agreed to allow me to incorporate his
> implementation of
> the FR algorithm in R under GPL.
>
> For the moment, I have made an alpha-test package with a single
> function --
> 'kuantile' -- that attempts to reproduce the functionality of the
> current
> base quantile function, essentially replacing the partial sorting done
> there with calls to Kiwiel's 'select' function.  This package is
> available
> from:
>
>         http://www.econ.uiuc.edu/~roger/research/rq/kuantile
>
> The good news is that the new function  _is_ somewhat quicker than
> the base 'quantile' function.  The following table is based on means
> of 10 replications.  The first two columns report times for computing
> just the median, the next two columns for computing the five default
> quantiles:  seq(0,1,.25), and the last column is the time needed for
> a call
> of sort().
>
>         mean system.time()[1] for various calls*
>
>           median only      default 5
> n      quantile kuantile quantile kuantile      sort
>
> 100        0.002    0.001    0.004    0.004       0.000
> 1000      0.002    0.001    0.003    0.002       0.002
> 10000    0.003    0.002    0.009    0.005       0.005
> 1e+05    0.024    0.010    0.061    0.013       0.031
> 1e+06    0.206    0.120    0.535    0.142       0.502
> 1e+07    2.132    0.790    5.575    1.035       7.484
>
> * On a mac G5 running R 2.2.1.
>
> The bad news is that this approach doesn't help with Deepayan's
> original question which involved instances in which quantile() was
> being called for rather large number of probabilities.  In such
> cases it seems that it is difficult to improve upon doing a full sort.
>
> I would welcome comments on any of this....
>
> url:    www.econ.uiuc.edu/~roger    Roger Koenker
> email:  [hidden email]           Department of Economics
> vox:    217-333-4558                University of Illinois
> fax:    217-244-6678                Champaign, IL 61820
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel