# Suggestions to speed up median() and has.na()

8 messages
Open this post in threaded view
|

## Suggestions to speed up median() and has.na()

 Hi, I've got two suggestions how to speed up median() about 50%.  For all iterative methods calling median() in the loops this has a major impact.  The second suggestion will apply to other methods too. This is what the functions look like today: > median function (x, na.rm = FALSE) {     if (is.factor(x) || mode(x) != "numeric")         stop("need numeric data")     if (na.rm)         x <- x[!is.na(x)]     else if (any(is.na(x)))         return(NA)     n <- length(x)     if (n == 0)         return(NA)     half <- (n + 1)/2     if (n%%2 == 1) {         sort(x, partial = half)[half]     }     else {         sum(sort(x, partial = c(half, half + 1))[c(half, half +             1)])/2     } } Suggestion 1: Replace the sort() calls with the .Internal(psort(x, partial)).   This will avoid unnecessary overhead, especially an expensive second check for NAs using any(is.na(x)).  Simple benchmarking with x <- rnorm(10e6) system.time(median(x))/system.time(median2(x)) where median2() is the function with the above replacements, gives about 20-25% speed up. Suggestion 2: Create a has.na(x) function to replace any(is.na(x)) that returns TRUE as soon as a NA value is detected.  In the best case it returns after the first index with TRUE, in the worst case it returns after the last index N with FALSE.  The cost for is.na(x) is always O(N), and any() in the best case O(1) and in the worst case O(N) (if any() is implemented as I hope).  An has.na() function would be very useful elsewhere too. An poor mans alternative to (2), is to have a third alternative to 'na.rm', say, NA, which indicates that we know that there are no NAs in 'x'. The original median() is approx 50% slower (naive benchmarking) than a version with the above two improvements, if passing a large 'x' with no NAs; median2 <- function (x, na.rm = FALSE) {     if (is.factor(x) || mode(x) != "numeric")         stop("need numeric data")     if (is.na(na.rm)) {     } else if (na.rm)         x <- x[!is.na(x)]     else if (any(is.na(x)))         return(NA)     n <- length(x)     if (n == 0)         return(NA)     half <- (n + 1)/2     if (n%%2 == 1) {         .Internal(psort(x, half))[half]     }     else {         sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2     } } x <- rnorm(10e5) K <- 10 t0 <- system.time({   for (kk in 1:K)     y <- median(x); }) print(t0)  # [1] 1.82 0.14 1.98   NA   NA t1 <- system.time({   for (kk in 1:K)     y <- median2(x, na.rm=NA); }) print(t1)  # [1] 1.25 0.06 1.34   NA   NA print(t0/t1)  # [1] 1.456000 2.333333 1.477612       NA       NA BTW, without having checked the source code, it looks like is.na() is unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on a vector without NAs.  On the other hand, is.na(sum(x)) becomes awfully slow if 'x' contains NAs. /Henrik ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: Suggestions to speed up median() and has.na()

 On Mon, 10 Apr 2006, Henrik Bengtsson wrote: > Hi, > > I've got two suggestions how to speed up median() about 50%.  For all > iterative methods calling median() in the loops this has a major > impact.  The second suggestion will apply to other methods too. I'm surprised this has a major impact -- in your example it takes much longer to generate the ten million numbers than to find the median. > Suggestion 1: > Replace the sort() calls with the .Internal(psort(x, partial)).   This > will avoid unnecessary overhead, especially an expensive second check > for NAs using any(is.na(x)).  Simple benchmarking with > > x <- rnorm(10e6) > system.time(median(x))/system.time(median2(x)) > > where median2() is the function with the above replacements, gives > about 20-25% speed up. There's something that seems a bit undesirable about having median() call the .Internal function for sort(). > Suggestion 2: > Create a has.na(x) function to replace any(is.na(x)) that returns TRUE > as soon as a NA value is detected.  In the best case it returns after > the first index with TRUE, in the worst case it returns after the last > index N with FALSE.  The cost for is.na(x) is always O(N), and any() > in the best case O(1) and in the worst case O(N) (if any() is > implemented as I hope).  An has.na() function would be very useful > elsewhere too. This sounds useful (though it has missed the deadline for 2.3.0). It won't help if the typical case is no missing values, as you suggest, but it will be faster when there are missing values. > BTW, without having checked the source code, it looks like is.na() is > unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on > a vector without NAs.  On the other hand, is.na(sum(x)) becomes > awfully slow if 'x' contains NAs. > I don't think  it is unnecessarily slow.  It has to dispatch methods and it has to make sure that matrix structure is preserved.  After that the code is just      case REALSXP:          for (i = 0; i < n; i++)              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);          break; and it's hard to see how that can be improved. It does suggest that a faster anyNA() function would have to not be generic.   -thomas Thomas Lumley Assoc. Professor, Biostatistics [hidden email] University of Washington, Seattle ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: Suggestions to speed up median() and has.na()

 On Mon, 10 Apr 2006, Thomas Lumley wrote: > On Mon, 10 Apr 2006, Henrik Bengtsson wrote: > > > Hi, > > > > I've got two suggestions how to speed up median() about 50%.  For all > > iterative methods calling median() in the loops this has a major > > impact.  The second suggestion will apply to other methods too. > > > Suggestion 2: > > Create a has.na(x) function to replace any(is.na(x)) that returns TRUE > > as soon as a NA value is detected.  In the best case it returns after > > the first index with TRUE, in the worst case it returns after the last > > index N with FALSE.  The cost for is.na(x) is always O(N), and any() > > in the best case O(1) and in the worst case O(N) (if any() is > > implemented as I hope).  An has.na() function would be very useful > > elsewhere too. > > This sounds useful (though it has missed the deadline for 2.3.0). > > It won't help if the typical case is no missing values, as you suggest, > but it will be faster when there are missing values. Splus has such a function, but it is called anyMissing().  In the interests of interoperability it would be nice if R used that name. (I did not choose the name, but that is what it is.) The following experiment using Splus seems to indicate the speedup has less to do with stopping at the first NA than it does with not making/filling/copying/whatever the big vector of logicals that is.na returns.    > # NA near start of list of 10 million integers    > { z<-replace(1:1e7,2,NA); unix.time(anyMissing(z)) }    [1] 0 0 0 0 0    > { z<-replace(1:1e7,2,NA); unix.time(any(is.na(z)))}    [1] 0.62 0.13 0.75 0.00 0.00    > # NA at end of list    > { z<-replace(1:1e7,1e7,NA); unix.time(anyMissing(z)) }    [1] 0.07 0.00 0.07 0.00 0.00    > { z<-replace(1:1e7,1e7,NA); unix.time(any(is.na(z)))}    [1] 0.64 0.11 0.75 0.00 0.00 The Splus anyMissing is an s3 generic (i.e., it calls UseMethod()). The Splus is.na is an s4 generic and its default method may invoke an s3 generic. > > BTW, without having checked the source code, it looks like is.na() is > > unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on > > a vector without NAs.  On the other hand, is.na(sum(x)) becomes > > awfully slow if 'x' contains NAs. > > > > I don't think  it is unnecessarily slow.  It has to dispatch methods and > it has to make sure that matrix structure is preserved.  After that the > code is just > >      case REALSXP: >          for (i = 0; i < n; i++) >              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); >          break; > > and it's hard to see how that can be improved. It does suggest that a > faster anyNA() function would have to not be generic. ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146  "All statements in this message represent the opinions of the author and do  not necessarily reflect Insightful Corporation policy or position." ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: Suggestions to speed up median() and has.na()

 In reply to this post by Thomas Lumley On 4/10/2006 7:22 PM, Thomas Lumley wrote: > On Mon, 10 Apr 2006, Henrik Bengtsson wrote: > >> Hi, >> >> I've got two suggestions how to speed up median() about 50%.  For all >> iterative methods calling median() in the loops this has a major >> impact.  The second suggestion will apply to other methods too. > > I'm surprised this has a major impact -- in your example it takes much > longer to generate the ten million numbers than to find the median. > >> Suggestion 1: >> Replace the sort() calls with the .Internal(psort(x, partial)).   This >> will avoid unnecessary overhead, especially an expensive second check >> for NAs using any(is.na(x)).  Simple benchmarking with >> >> x <- rnorm(10e6) >> system.time(median(x))/system.time(median2(x)) >> >> where median2() is the function with the above replacements, gives >> about 20-25% speed up. > > There's something that seems a bit undesirable about having median() call > the .Internal function for sort(). > >> Suggestion 2: >> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE >> as soon as a NA value is detected.  In the best case it returns after >> the first index with TRUE, in the worst case it returns after the last >> index N with FALSE.  The cost for is.na(x) is always O(N), and any() >> in the best case O(1) and in the worst case O(N) (if any() is >> implemented as I hope).  An has.na() function would be very useful >> elsewhere too. > > This sounds useful (though it has missed the deadline for 2.3.0). > > It won't help if the typical case is no missing values, as you suggest, > but it will be faster when there are missing values. I think it would help even in that case if the vector is large, because it avoids allocating and disposing of the logical vector of the same length as x. >> BTW, without having checked the source code, it looks like is.na() is >> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on >> a vector without NAs.  On the other hand, is.na(sum(x)) becomes >> awfully slow if 'x' contains NAs. >> > > I don't think  it is unnecessarily slow.  It has to dispatch methods and > it has to make sure that matrix structure is preserved.  After that the > code is just > >      case REALSXP: >          for (i = 0; i < n; i++) >              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); >          break; > > and it's hard to see how that can be improved. It does suggest that a > faster anyNA() function would have to not be generic. If it's necessary to make it not generic to achieve the speedup, I don't think it's worth doing.  If anyNA is written not to be generic I'd guess a very common error will be to apply it to a dataframe and get a misleading "FALSE" answer.  If we do that, I predict that the total amount of r-help time wasted on it will exceed the CPU time saved by orders of magnitude. Duncan Murdoch ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: Suggestions to speed up median() and has.na()

 On Mon, 10 Apr 2006, Duncan Murdoch wrote: > On 4/10/2006 7:22 PM, Thomas Lumley wrote: >> On Mon, 10 Apr 2006, Henrik Bengtsson wrote: >> >>> Suggestion 2: >>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE >>> as soon as a NA value is detected.  In the best case it returns after >>> the first index with TRUE, in the worst case it returns after the last >>> index N with FALSE.  The cost for is.na(x) is always O(N), and any() >>> in the best case O(1) and in the worst case O(N) (if any() is >>> implemented as I hope).  An has.na() function would be very useful >>> elsewhere too. >> >> This sounds useful (though it has missed the deadline for 2.3.0). >> >> It won't help if the typical case is no missing values, as you suggest, but >> it will be faster when there are missing values. > > I think it would help even in that case if the vector is large, because it > avoids allocating and disposing of the logical vector of the same length as > x. That makes sense. I have just tried, and for vectors of length ten million it does make a measurable difference. >>> BTW, without having checked the source code, it looks like is.na() is >>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on >>> a vector without NAs.  On the other hand, is.na(sum(x)) becomes >>> awfully slow if 'x' contains NAs. >>> >> >> I don't think  it is unnecessarily slow.  It has to dispatch methods and it >> has to make sure that matrix structure is preserved.  After that the code >> is just >> >>      case REALSXP: >>          for (i = 0; i < n; i++) >>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); >>          break; >> >> and it's hard to see how that can be improved. It does suggest that a >> faster anyNA() function would have to not be generic. > > If it's necessary to make it not generic to achieve the speedup, I don't > think it's worth doing.  If anyNA is written not to be generic I'd guess a > very common error will be to apply it to a dataframe and get a misleading > "FALSE" answer.  If we do that, I predict that the total amount of r-help > time wasted on it will exceed the CPU time saved by orders of magnitude. > I wasn't proposing that it should be stupid, just not generic.  It could support data frames (sum(), does, for example). If it didn't support data frames it should certainly give an error rather than the wrong answer, but if we are seriously trying to avoid delays around 0.1 seconds then going through the generic function mechanism may be a problem.   -thomas ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: Suggestions to speed up median() and has.na()

 On 4/10/2006 8:08 PM, Thomas Lumley wrote: > On Mon, 10 Apr 2006, Duncan Murdoch wrote: > >> On 4/10/2006 7:22 PM, Thomas Lumley wrote: >>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote: >>> >>>> Suggestion 2: >>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE >>>> as soon as a NA value is detected.  In the best case it returns after >>>> the first index with TRUE, in the worst case it returns after the last >>>> index N with FALSE.  The cost for is.na(x) is always O(N), and any() >>>> in the best case O(1) and in the worst case O(N) (if any() is >>>> implemented as I hope).  An has.na() function would be very useful >>>> elsewhere too. >>> This sounds useful (though it has missed the deadline for 2.3.0). >>> >>> It won't help if the typical case is no missing values, as you suggest, but >>> it will be faster when there are missing values. >> I think it would help even in that case if the vector is large, because it >> avoids allocating and disposing of the logical vector of the same length as >> x. > > That makes sense. I have just tried, and for vectors of length ten > million it does make a measurable difference. > > >>>> BTW, without having checked the source code, it looks like is.na() is >>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on >>>> a vector without NAs.  On the other hand, is.na(sum(x)) becomes >>>> awfully slow if 'x' contains NAs. >>>> >>> I don't think  it is unnecessarily slow.  It has to dispatch methods and it >>> has to make sure that matrix structure is preserved.  After that the code >>> is just >>> >>>      case REALSXP: >>>          for (i = 0; i < n; i++) >>>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); >>>          break; >>> >>> and it's hard to see how that can be improved. It does suggest that a >>> faster anyNA() function would have to not be generic. >> If it's necessary to make it not generic to achieve the speedup, I don't >> think it's worth doing.  If anyNA is written not to be generic I'd guess a >> very common error will be to apply it to a dataframe and get a misleading >> "FALSE" answer.  If we do that, I predict that the total amount of r-help >> time wasted on it will exceed the CPU time saved by orders of magnitude. >> > > I wasn't proposing that it should be stupid, just not generic.  It could > support data frames (sum(), does, for example). If it didn't support data > frames it should certainly give an error rather than the wrong answer, but > if we are seriously trying to avoid delays around 0.1 seconds then going > through the generic function mechanism may be a problem. If it's not dataframes, it will be something else.  I think it's highly desirable that any(is.na(x)) == anyNA(x) within base packages, and we should make it straightforward to maintain this identity in contributed packages. By the way, I think Bill's suggestion of calling it anyMissing makes a lot of sense. Duncan ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel