# any other fast method for median calculation

6 messages
Open this post in threaded view
|

## any other fast method for median calculation

 Hi there, I got a data frame with more than 200k columns. How could I get median of each column fast? mapply is the fastest function I know for that, it's not yet satisfied though. It seems function "median" in R calculates median by "sort" and "mean". I am wondering if there is another function with better algorithm. Any hint? Thanks, Xin Zheng ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: any other fast method for median calculation

 Sorting with an appropriate algorithm is nlog(n), so it's very hard to get the 'exact' median any faster. However, if you can cope with a less precise median, you could use a binary search between max(x) and min(x) with low tolerance or comparatively few iterations. In native R, though, that isn;t going to be fast; interpreter overhead will likely more than wipe out any reduction in number of comparisons. In any case, it looks like you are not constrained by the median algorithm, but by the number of calls. You might do a lot better with apply, though > apply(df,2,median) On my system 200k columns were processed in negligible time by apply and I'm still waiting for mapply. S >>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40 >>> Hi there, I got a data frame with more than 200k columns. How could I get median of each column fast? mapply is the fastest function I know for that, it's not yet satisfied though. It seems function "median" in R calculates median by "sort" and "mean". I am wondering if there is another function with better algorithm. Any hint? Thanks, Xin Zheng ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html  and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}} ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: any other fast method for median calculation

 S Ellison wrote: > Sorting with an appropriate algorithm is nlog(n), so it's very hard to > get the 'exact' median any faster. However, if you can cope with a less > precise median, you could use a binary search between max(x) and min(x) > with low tolerance or comparatively few iterations. In native R, though, > that isn;t going to be fast; interpreter overhead will likely more than > wipe out any reduction in number of comparisons. > > In any case, it looks like you are not constrained by the median > algorithm, but by the number of calls. You might do a lot better with > apply, though >> apply(df,2,median) well, for data frames, I think sapply(...) or even unlist(lapply(...)) will be faster, e.g., mat <- matrix(rnorm(50*2e05), 50, 2e05) DF <- as.data.frame(mat) invisible({gc(); gc()}) system.time(apply(DF, 2, median)) invisible({gc(); gc()}) system.time(sapply(DF, median)) invisible({gc(); gc()}) system.time(unlist(lapply(DF, median), use.names = FALSE)) Best, Dimitris > On my system 200k columns were processed in negligible time by apply > and I'm still waiting for mapply. > > S > > > >>>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40 >>>> > Hi there, > > I got a data frame with more than 200k columns. How could I get median > of each column fast? mapply is the fastest function I know for that, > it's not yet satisfied though. > > It seems function "median" in R calculates median by "sort" and "mean". > I am wondering if there is another function with better algorithm. > > Any hint? > > Thanks, > > Xin Zheng > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help  > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html  > and provide commented, minimal, self-contained, reproducible code. > > ******************************************************************* > This email and any attachments are confidential. Any use...{{dropped:8}} > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. > -- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: any other fast method for median calculation

 there is function rowMedians in Bioconductor package Biobase which works for numeric matrices and might help. Matthias Dimitris Rizopoulos wrote: > S Ellison wrote: >> Sorting with an appropriate algorithm is nlog(n), so it's very hard to >> get the 'exact' median any faster. However, if you can cope with a less >> precise median, you could use a binary search between max(x) and min(x) >> with low tolerance or comparatively few iterations. In native R, though, >> that isn;t going to be fast; interpreter overhead will likely more than >> wipe out any reduction in number of comparisons. >> >> In any case, it looks like you are not constrained by the median >> algorithm, but by the number of calls. You might do a lot better with >> apply, though >>> apply(df,2,median) > > well, for data frames, I think sapply(...) or even unlist(lapply(...)) > will be faster, e.g., > > mat <- matrix(rnorm(50*2e05), 50, 2e05) > DF <- as.data.frame(mat) > > invisible({gc(); gc()}) > system.time(apply(DF, 2, median)) > > invisible({gc(); gc()}) > system.time(sapply(DF, median)) > > invisible({gc(); gc()}) > system.time(unlist(lapply(DF, median), use.names = FALSE)) > > > Best, > Dimitris > > >> On my system 200k columns were processed in negligible time by apply >> and I'm still waiting for mapply. >> >> S >> >> >> >>>>> "Zheng, Xin (NIH) [C]" <[hidden email]> 14/04/2009 05:29:40 >>>>> >> Hi there, >> >> I got a data frame with more than 200k columns. How could I get median >> of each column fast? mapply is the fastest function I know for that, >> it's not yet satisfied though. >> It seems function "median" in R calculates median by "sort" and "mean". >> I am wondering if there is another function with better algorithm. >> >> Any hint? >> >> Thanks, >> >> Xin Zheng >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the >> posting guide >> http://www.R-project.org/posting-guide.html and provide commented, >> minimal, self-contained, reproducible code. >> >> ******************************************************************* >> This email and any attachments are confidential. Any use...{{dropped:8}} >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help>> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html>> and provide commented, minimal, self-contained, reproducible code. >> > -- Dr. Matthias Kohl www.stamats.de ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.