# Need very fast application of 'diff' - ideas?

11 messages
Open this post in threaded view
|

## Need very fast application of 'diff' - ideas?

 Hi everyone, Speed is the key here. I need to find the difference between a vector and its one-period lag (i.e. the difference between each value and the subsequent one in the vector). Let's say the vector contains 10 million random integers between 0 and 1,000. The solution vector will have 9,999,999 values, since their is no lag for the 1st observation. In R we have: #Set up input vector x = runif(n=10e6, min=0, max=1000) x = round(x) #Find one-period difference y = diff(x) Question is: How can I get the 'diff(x)' part as fast as absolutely possible? I queried some colleagues who work with other languages, and they provided equivalent solutions in Python and Clojure that, on their machines, appear to be potentially much faster (I've put the code below in case anyone is interested). However, they mentioned that the overhead in passing the data between languages could kill any improvements. I don't have much experience integrating other languages, so I'm hoping the community has some ideas about how to approach this particular problem... Many thanks, Kevin In iPython: In [3]: import numpy as np In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16") In [5]: arr1 = arr[1:].view() In [6]: timeit arr2 = arr1 - arr[:-1] 10 loops, best of 3: 20.1 ms per loop In Clojure: (defn subtract-lag   [n]   (let [v (take n (repeatedly rand))]     (time (dorun (map - v (cons 0 v))))))         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

 I'd write your own diff() that eliminates the method dispatch and argument checking that diff -> diff.default does. x[-1] - x[-len(x)] # is all you really need. (# you could also try something like c(x[-1], NA) - x which may be marginally faster as it only subsets x once but you should profile to find out) is probably about as fast as you can get within pure R code (the function overhead will add a little bit of time as well, so if speed is truly the only thing that matters, best not to use it. If you wanna go for even more speed, you'll have to go to compiled code; I'd suggest inline+Rcpp as the easiest way to do so. That could get it down to a single pass through the vector in pure C (or nice C++) which seems to be a lower bound for speed. Michael On Fri, Jan 27, 2012 at 7:15 PM, Kevin Ummel <[hidden email]> wrote: > Hi everyone, > > Speed is the key here. > > I need to find the difference between a vector and its one-period lag (i.e. the difference between each value and the subsequent one in the vector). Let's say the vector contains 10 million random integers between 0 and 1,000. The solution vector will have 9,999,999 values, since their is no lag for the 1st observation. > > In R we have: > > #Set up input vector > x = runif(n=10e6, min=0, max=1000) > x = round(x) > > #Find one-period difference > y = diff(x) > > Question is: How can I get the 'diff(x)' part as fast as absolutely possible? I queried some colleagues who work with other languages, and they provided equivalent solutions in Python and Clojure that, on their machines, appear to be potentially much faster (I've put the code below in case anyone is interested). However, they mentioned that the overhead in passing the data between languages could kill any improvements. I don't have much experience integrating other languages, so I'm hoping the community has some ideas about how to approach this particular problem... > > Many thanks, > Kevin > > In iPython: > > In [3]: import numpy as np > In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16") > In [5]: arr1 = arr[1:].view() > In [6]: timeit arr2 = arr1 - arr[:-1] > 10 loops, best of 3: 20.1 ms per loop > > In Clojure: > > (defn subtract-lag >  [n] >  (let [v (take n (repeatedly rand))] >    (time (dorun (map - v (cons 0 v)))))) > > > > > >        [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

 In reply to this post by Kevin Ummel ehm... this doesn't take very many ideas. x = runif(n=10e6, min=0, max=1000) x = round(x) system.time( {   y = x[-1] - x[-length(x)] }) I get about 0.5 seconds on my old laptop. HTH Peter On Fri, Jan 27, 2012 at 4:15 PM, Kevin Ummel <[hidden email]> wrote: > Hi everyone, > > Speed is the key here. > > I need to find the difference between a vector and its one-period lag (i.e. the difference between each value and the subsequent one in the vector). Let's say the vector contains 10 million random integers between 0 and 1,000. The solution vector will have 9,999,999 values, since their is no lag for the 1st observation. > > In R we have: > > #Set up input vector > x = runif(n=10e6, min=0, max=1000) > x = round(x) > > #Find one-period difference > y = diff(x) > > Question is: How can I get the 'diff(x)' part as fast as absolutely possible? I queried some colleagues who work with other languages, and they provided equivalent solutions in Python and Clojure that, on their machines, appear to be potentially much faster (I've put the code below in case anyone is interested). However, they mentioned that the overhead in passing the data between languages could kill any improvements. I don't have much experience integrating other languages, so I'm hoping the community has some ideas about how to approach this particular problem... > > Many thanks, > Kevin > > In iPython: > > In [3]: import numpy as np > In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16") > In [5]: arr1 = arr[1:].view() > In [6]: timeit arr2 = arr1 - arr[:-1] > 10 loops, best of 3: 20.1 ms per loop > > In Clojure: > > (defn subtract-lag >  [n] >  (let [v (take n (repeatedly rand))] >    (time (dorun (map - v (cons 0 v)))))) > > > > > >        [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

 In reply to this post by Michael Weylandt R. Michael Weylandt gmail.com> writes: > > I'd write your own diff() that eliminates the method dispatch and > argument checking that diff -> diff.default does. > > x[-1] - x[-len(x)] # is all you really need. > (# you could also try something like c(x[-1], NA) - x which may be > marginally faster as it only subsets x once but you should profile to > find out) > > is probably about as fast as you can get within pure R code (the > function overhead will add a little bit of time as well, so if speed > is truly the only thing that matters, best not to use it. If you wanna > go for even more speed, you'll have to go to compiled code; I'd > suggest inline+Rcpp as the easiest way to do so. That could get it > down to a single pass through the vector in pure C (or nice C++) which > seems to be a lower bound for speed. > > Michael Python has become astonishingly fast during the last years. On an iMAc with 3.06 GHz I can see the following timings (though I do feel a bit suspicious about the timings Python reports):     Python      0.040 s     Version 2.6.1, 1e7 integer elements     Matlab      0.095 s     Matlab's diff function (Version R2011b)     Matlab      0.315 s     Matlab using x(2:N)-x(1:(N-1))     R 2.14.1    0.375 s     R's diff() function     R           0.365 s     R using x[-1]-x[-N]     R           0.270 s     R using c(x[-1],NA)-x)     R+Fortran   0.180 s     R function calling .Fortran     R+C         0.180 s     R function calling .C where---as an example---the C code looks like:     void diff(int *n, int *x, int *d)     { for (long i=0; i<*n-2; i++) d[i] = x[i+1] - x[i]; } There appears to be a factor of 4 between R+compiled code and Python code. It is also interesting to see that in Matlab 'diff' is considerably faster than differencing vectors, while in R it is slower. P. S.:  To make the comparison fair I have used the following Python call:     python -m timeit -n 1 -r 1         -s 'import numpy'         -s 'arr = numpy.random.randint(0, 1000, (10000000,1)).astype("int32")'         'diff = arr[1:] - arr[:-1]' i.e., used 32-bit integers and included the indexing in the loop. > On Fri, Jan 27, 2012 at 7:15 PM, Kevin Ummel gmail.com> wrote: > > Hi everyone, > > > > Speed is the key here. > > > > I need to find the difference between a vector and its one-period lag > > (i.e. the difference between each value and the subsequent one in the > > vector). Let's say the vector contains 10 million random integers > > between 0 and 1,000. The solution vector will have 9,999,999 values, > > since their is no lag for the 1st observation. > > > > In R we have: > > > > #Set up input vector > > x = runif(n=10e6, min=0, max=1000) > > x = round(x) > > > > #Find one-period difference > > y = diff(x) > > > > Question is: How can I get the 'diff(x)' part as fast as absolutely > > possible? I queried some colleagues who work with other languages, and > > they provided equivalent solutions in Python and Clojure that, on their > > machines, appear to be potentially much faster > > (I've put the code below in case anyone is interested). > > However, they mentioned that the overhead in passing the data between > > languages could kill any improvements. I don't have much experience > > integrating other languages, so I'm hoping the community has some ideas > > about how to approach this particular problem... > > > > Many thanks, > > Kevin > > > > In iPython: > > > > In [3]: import numpy as np > > In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16") > > In [5]: arr1 = arr[1:].view() > > In [6]: timeit arr2 = arr1 - arr[:-1] > > 10 loops, best of 3: 20.1 ms per loop > > > > In Clojure: > > > > (defn subtract-lag > >  [n] > >  (let [v (take n (repeatedly rand))] > >    (time (dorun (map - v (cons 0 v)))))) > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

 In reply to this post by plangfelder Thanks. I've played around with pure R solutions. The fastest re-write of diff (for the 1 lag case) I can seem to find is this: diff2 = function(x) {   y = c(x,NA) - c(NA,x)   y[2:length(x)] } #Compiling via 'cmpfun' doesn't seem to help (or hurt): require(compiler) diff2 = cmpfun(diff2) But that only gets ~10% improvement over default 'diff' on my machine. Still too slow for my particular application. I'm inclined towards Michael's suggestion of inline+Rcpp (or some other use of C under the hood). Could someone show me how to go about doing that? Thanks! Kevin On Jan 28, 2012, at 9:14 AM, Peter Langfelder wrote: > ehm... this doesn't take very many ideas. > > > x = runif(n=10e6, min=0, max=1000) > x = round(x) > > system.time( { >  y = x[-1] - x[-length(x)] > }) > > I get about 0.5 seconds on my old laptop. > > HTH > > Peter > > > On Fri, Jan 27, 2012 at 4:15 PM, Kevin Ummel <[hidden email]> wrote: >> Hi everyone, >> >> Speed is the key here. >> >> I need to find the difference between a vector and its one-period lag (i.e. the difference between each value and the subsequent one in the vector). Let's say the vector contains 10 million random integers between 0 and 1,000. The solution vector will have 9,999,999 values, since their is no lag for the 1st observation. >> >> In R we have: >> >> #Set up input vector >> x = runif(n=10e6, min=0, max=1000) >> x = round(x) >> >> #Find one-period difference >> y = diff(x) >> >> Question is: How can I get the 'diff(x)' part as fast as absolutely possible? I queried some colleagues who work with other languages, and they provided equivalent solutions in Python and Clojure that, on their machines, appear to be potentially much faster (I've put the code below in case anyone is interested). However, they mentioned that the overhead in passing the data between languages could kill any improvements. I don't have much experience integrating other languages, so I'm hoping the community has some ideas about how to approach this particular problem... >> >> Many thanks, >> Kevin >> >> In iPython: >> >> In [3]: import numpy as np >> In [4]: arr = np.random.randint(0, 1000, (10000000,1)).astype("int16") >> In [5]: arr1 = arr[1:].view() >> In [6]: timeit arr2 = arr1 - arr[:-1] >> 10 loops, best of 3: 20.1 ms per loop >> >> In Clojure: >> >> (defn subtract-lag >>  [n] >>  (let [v (take n (repeatedly rand))] >>    (time (dorun (map - v (cons 0 v)))))) >> >> >> >> >> >>        [[alternative HTML version deleted]] >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html>> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

Open this post in threaded view
|

## Re: Need very fast application of 'diff' - ideas?

Open this post in threaded view
|