any other fast method for median calculation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

any other fast method for median calculation

Zheng, Xin (NIH) [C]
Hi there,

I got a data frame with more than 200k columns. How could I get median of each column fast? mapply is the fastest function I know for that, it's not yet satisfied though.

It seems function "median" in R calculates median by "sort" and "mean". I am wondering if there is another function with better algorithm.

Any hint?

Thanks,

Xin Zheng
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: any other fast method for median calculation

slre
Sorting with an appropriate algorithm is nlog(n), so it's very hard to
get the 'exact' median any faster. However, if you can cope with a less
precise median, you could use a binary search between max(x) and min(x)
with low tolerance or comparatively few iterations. In native R, though,
that isn;t going to be fast; interpreter overhead will likely more than
wipe out any reduction in number of comparisons.

In any case, it looks like you are not constrained by the median
algorithm, but by the number of calls. You might do a lot better with
apply, though
> apply(df,2,median)

On my system 200k columns were processed in negligible time by apply
and I'm still waiting for mapply.

S



>>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40
>>>
Hi there,

I got a data frame with more than 200k columns. How could I get median
of each column fast? mapply is the fastest function I know for that,
it's not yet satisfied though.

It seems function "median" in R calculates median by "sort" and "mean".
I am wondering if there is another function with better algorithm.

Any hint?

Thanks,

Xin Zheng
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: any other fast method for median calculation

D. Rizopoulos
S Ellison wrote:

> Sorting with an appropriate algorithm is nlog(n), so it's very hard to
> get the 'exact' median any faster. However, if you can cope with a less
> precise median, you could use a binary search between max(x) and min(x)
> with low tolerance or comparatively few iterations. In native R, though,
> that isn;t going to be fast; interpreter overhead will likely more than
> wipe out any reduction in number of comparisons.
>
> In any case, it looks like you are not constrained by the median
> algorithm, but by the number of calls. You might do a lot better with
> apply, though
>> apply(df,2,median)

well, for data frames, I think sapply(...) or even unlist(lapply(...))
will be faster, e.g.,

mat <- matrix(rnorm(50*2e05), 50, 2e05)
DF <- as.data.frame(mat)

invisible({gc(); gc()})
system.time(apply(DF, 2, median))

invisible({gc(); gc()})
system.time(sapply(DF, median))

invisible({gc(); gc()})
system.time(unlist(lapply(DF, median), use.names = FALSE))


Best,
Dimitris


> On my system 200k columns were processed in negligible time by apply
> and I'm still waiting for mapply.
>
> S
>
>
>
>>>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40
>>>>
> Hi there,
>
> I got a data frame with more than 200k columns. How could I get median
> of each column fast? mapply is the fastest function I know for that,
> it's not yet satisfied though.
>
> It seems function "median" in R calculates median by "sort" and "mean".
> I am wondering if there is another function with better algorithm.
>
> Any hint?
>
> Thanks,
>
> Xin Zheng
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help 
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code.
>
> *******************************************************************
> This email and any attachments are confidential. Any use...{{dropped:8}}
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: any other fast method for median calculation

Prof. Dr. Matthias Kohl
there is function rowMedians in Bioconductor package Biobase which works
for numeric matrices and might help.
Matthias

Dimitris Rizopoulos wrote:

> S Ellison wrote:
>> Sorting with an appropriate algorithm is nlog(n), so it's very hard to
>> get the 'exact' median any faster. However, if you can cope with a less
>> precise median, you could use a binary search between max(x) and min(x)
>> with low tolerance or comparatively few iterations. In native R, though,
>> that isn;t going to be fast; interpreter overhead will likely more than
>> wipe out any reduction in number of comparisons.
>>
>> In any case, it looks like you are not constrained by the median
>> algorithm, but by the number of calls. You might do a lot better with
>> apply, though
>>> apply(df,2,median)
>
> well, for data frames, I think sapply(...) or even unlist(lapply(...))
> will be faster, e.g.,
>
> mat <- matrix(rnorm(50*2e05), 50, 2e05)
> DF <- as.data.frame(mat)
>
> invisible({gc(); gc()})
> system.time(apply(DF, 2, median))
>
> invisible({gc(); gc()})
> system.time(sapply(DF, median))
>
> invisible({gc(); gc()})
> system.time(unlist(lapply(DF, median), use.names = FALSE))
>
>
> Best,
> Dimitris
>
>
>> On my system 200k columns were processed in negligible time by apply
>> and I'm still waiting for mapply.
>>
>> S
>>
>>
>>
>>>>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40
>>>>>
>> Hi there,
>>
>> I got a data frame with more than 200k columns. How could I get median
>> of each column fast? mapply is the fastest function I know for that,
>> it's not yet satisfied though.
>> It seems function "median" in R calculates median by "sort" and "mean".
>> I am wondering if there is another function with better algorithm.
>>
>> Any hint?
>>
>> Thanks,
>>
>> Xin Zheng
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
>> posting guide
>> http://www.R-project.org/posting-guide.html and provide commented,
>> minimal, self-contained, reproducible code.
>>
>> *******************************************************************
>> This email and any attachments are confidential. Any use...{{dropped:8}}
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

--
Dr. Matthias Kohl
www.stamats.de

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: any other fast method for median calculation

RKoenker
In reply to this post by Zheng, Xin (NIH) [C]
There is a slightly faster algorithm in my quantreg package, see  
kuantile()
but this is only significant when sample sizes are very large.  In  
your case
you really need a wrapper that keeps the loop over columns within some
lower level language.

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    [hidden email]            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820



On Apr 13, 2009, at 11:29 PM, Zheng, Xin (NIH) [C] wrote:

> Hi there,
>
> I got a data frame with more than 200k columns. How could I get  
> median of each column fast? mapply is the fastest function I know  
> for that, it's not yet satisfied though.
>
> It seems function "median" in R calculates median by "sort" and  
> "mean". I am wondering if there is another function with better  
> algorithm.
>
> Any hint?
>
> Thanks,
>
> Xin Zheng
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: any other fast method for median calculation

Thomas Lumley
In reply to this post by slre
On Tue, 14 Apr 2009, S Ellison wrote:

> Sorting with an appropriate algorithm is nlog(n), so it's very hard to
> get the 'exact' median any faster.

There actually are linear-time algorithms for the median, but n has to be very large before they are worth using, and by then you have to start considering locality of reference and other issues.

> In any case, it looks like you are not constrained by the median
> algorithm, but by the number of calls. You might do a lot better with
> apply, though
>> apply(df,2,median)
>
> On my system 200k columns were processed in negligible time by apply
> and I'm still waiting for mapply.

I'd also note that this is the sort of problem where the profiler is useful: you can see on a smaller subset whether R is spending most of its time in median() or somewhere else.

I wouldn't be surprised if a while() loop was even faster than apply() in this setting, but probably not enough to care about.

       -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.