# Is R the right choice for simulating first passage times of random walks?

15 messages
Open this post in threaded view
|

## Is R the right choice for simulating first passage times of random walks?

 Dear R folks, I need to simulate first passage times for iterated partial sums. The related papers are for example [1][2]. As a start I want to simulate how long a simple random walk stays negative, which should result that it behaves like n^(-½). My code looks like this. -------- 8< -------- code -------- >8 -------- n = 100000 # number of simulations length = 100000 # length of iterated sum z = rep(0, times=length + 1) for (i in 1:n) {         x = c(0, sign(rnorm(length)))         s = cumsum(x)         for (i in 1:length) {                 if (s[i] < 0 && s[i + 1] >= 0) {                         z[i] = z[i] + 1                 }         } } plot(1:length(z), z/n) curve(x**(-0.5), add = TRUE) -------- 8< -------- code -------- >8 -------- This code already runs for over half an hour on my system¹. Reading about the for loop [3] it says to try to avoid loops and I probably should use a matrix where every row is a sample. Now my first problem is that there is no matrix equivalent for cumsum(). Can I use matrices to avoid the for loop? My second question is, is R the right choice for such simulations? It would be great when R can also give me a confidence interval(?) and also try to fit a curve through the result and give me the rule of correspondence(?) [4]. Do you have pointers for those? I glanced at simFrame [5] and read ?simulate but could not understand it right away and think that this might be overkill. Do you have any suggestions? Thanks, Paul ¹ AMD Athlon(tm) X2 Dual Core Processor BE-2350, 2,1 GHz [1] http://www-stat.stanford.edu/~amir/preprints/irw.ps[2] http://arxiv.org/abs/0911.5456[3] http://cran.r-project.org/doc/manuals/R-intro.html#Repetitive-execution[4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics) [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Some more skilled folks can help with the curve fitting, but the general answer is yes -- R will handle this quite ably. Consider the following code: <<-------------------------------------->> n = 1e5 length = 1e5 R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. <<---------------------------------------->> There are actually even faster ways to do what you are asking for, but this introduces you to some useful R architecture, above all the apply function. To see how long the longest stretch of negatives in each row is, the following is a little sneaky but works pretty well: countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))}) then you can study these random numbers to do whatever with them. The gist of this code is that it counts how many positive number have been seen up to each point: for any stretch this doesn't increase, you must be negative, so this identifies the longest such stretch on each row and records the length. (It may be off by one so check it on a smaller R matrix. An empirical confidence interval might be given by this mu = mean(countNegative) sig = sd(countNegative)/sqrt(length(countNegative)) So all together: <<-------------------------------------->> n = 1e3 length = 1e3 R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. fTemp <- function(x) {     return(max(table(cumsum(x>=0)))) } countNegative = apply(R,1,fTemp) mu = mean(countNegative) sig = sd(countNegative)/sqrt(length(countNegative)) <<---------------------------------------->> This runs pretty fast on my laptop, but you'll need to look into the memory.limit() function if you want to increase the simulation parameters. There are much faster ways to handle the simulation as well, but this should get you off to a nice start with R. Hope this helps, Michael On Wed, Jul 27, 2011 at 7:36 PM, Paul Menzel < [hidden email]> wrote: > Dear R folks, > > > I need to simulate first passage times for iterated partial sums. The > related papers are for example [1][2]. > > As a start I want to simulate how long a simple random walk stays > negative, which should result that it behaves like n^(-½). My code looks > like this. > > -------- 8< -------- code -------- >8 -------- > n = 100000 # number of simulations > > length = 100000 # length of iterated sum > z = rep(0, times=length + 1) > > for (i in 1:n) { >        x = c(0, sign(rnorm(length))) >        s = cumsum(x) >        for (i in 1:length) { >                if (s[i] < 0 && s[i + 1] >= 0) { >                        z[i] = z[i] + 1 >                } >        } > } > plot(1:length(z), z/n) > curve(x**(-0.5), add = TRUE) > -------- 8< -------- code -------- >8 -------- > > This code already runs for over half an hour on my system¹. > > Reading about the for loop [3] it says to try to avoid loops and I > probably should use a matrix where every row is a sample. > > Now my first problem is that there is no matrix equivalent for > cumsum(). Can I use matrices to avoid the for loop? > > My second question is, is R the right choice for such simulations? It > would be great when R can also give me a confidence interval(?) and also > try to fit a curve through the result and give me the rule of > correspondence(?) [4]. Do you have pointers for those? > > I glanced at simFrame [5] and read ?simulate but could not understand > it right away and think that this might be overkill. > > Do you have any suggestions? > > > Thanks, > > Paul > > > ¹ AMD Athlon(tm) X2 Dual Core Processor BE-2350, 2,1 GHz > > > [1] http://www-stat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] > http://cran.r-project.org/doc/manuals/R-intro.html#Repetitive-execution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics) > [5] > http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html> > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. > >         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 In reply to this post by Paul Menzel Dear R folks, Am Donnerstag, den 28.07.2011, 01:36 +0200 schrieb Paul Menzel: > I need to simulate first passage times for iterated partial sums. The > related papers are for example [1][2]. > > As a start I want to simulate how long a simple random walk stays > negative, which should result that it behaves like n^(-½). My code looks > like this. > > -------- 8< -------- code -------- >8 -------- > n = 100000 # number of simulations > > length = 100000 # length of iterated sum > z = rep(0, times=length + 1) > > for (i in 1:n) { > x = c(0, sign(rnorm(length))) > s = cumsum(x) > for (i in 1:length) { > if (s[i] < 0 && s[i + 1] >= 0) { > z[i] = z[i] + 1 > } > } > } > plot(1:length(z), z/n) > curve(x**(-0.5), add = TRUE) > -------- 8< -------- code -------- >8 -------- Of course the program above is not complete, because it only checks for the first passage from negativ to positive. if (s[2] < 0) {} should be added before the for loop. > This code already runs for over half an hour on my system¹. > > Reading about the for loop [3] it says to try to avoid loops and I > probably should use a matrix where every row is a sample. > > Now my first problem is that there is no matrix equivalent for > cumsum(). Can I use matrices to avoid the for loop? I mean the inner for loop. Additionally I wonder if cumsum is really faster or if I should sum the elements by myself and check after every step if the walk gets non-negative/0. With a length of 1000000 this should save some cycles. On the other hand adding numbers should be really fast and adding checks in between could potentially be slower. > My second question is, is R the right choice for such simulations? It > would be great when R can also give me a confidence interval(?) and also > try to fit a curve through the result and give me the rule of > correspondence(?) [4]. Do you have pointers for those? > > I glanced at simFrame [5] and read ?simulate but could not understand > it right away and think that this might be overkill. > > Do you have any suggestions? Thanks, Paul > ¹ AMD Athlon(tm) X2 Dual Core Processor BE-2350, 2,1 GHz > > > [1] http://www-stat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.r-project.org/doc/manuals/R-intro.html#Repetitive-execution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics) > [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Top posting cuz hotmail decided not to highlight... Personally I would tend to use java or c++ for the inner loops but you could of course later make an R package out of that. This is especially true if your code will be used elsewhere in a performance critical system. For example, I wrote some c++ code for dealing with graphs nothing fancy but it let me play with some data structure ideas and I could then build it into standalone programs or perhaps an R package. Many of these things slow down due to memory incoherence or IO long before you use up the processor. With c++ in principle anyway you have a lot of control over these things. Once you have your results and want to anlyze them, that's when I would use R. Dumping simulation samples to a text file is easy to also lets you use other things like sed/grep or vi to explore as needed. ---------------------------------------- From: [hidden email] To: [hidden email] Date: Thu, 28 Jul 2011 02:00:13 +0200 Subject: Re: [R] Is R the right choice for simulating first passage times of random walks? Dear R folks, Am Donnerstag, den 28.07.2011, 01:36 +0200 schrieb Paul Menzel: > I need to simulate first passage times for iterated partial sums. The > related papers are for example [1][2]. > > As a start I want to simulate how long a simple random walk stays > negative, which should result that it behaves like n^(-½). My code looks > like this. > > -------- 8< -------- code -------- >8 -------- > n = 100000 # number of simulations > > length = 100000 # length of iterated sum > z = rep(0, times=length + 1) > > for (i in 1:n) { >       x = c(0, sign(rnorm(length))) >       s = cumsum(x) >       for (i in 1:length) { >               if (s[i] < 0 && s[i + 1] >= 0) { >                       z[i] = z[i] + 1 >               } >       } > } > plot(1:length(z), z/n) > curve(x**(-0.5), add = TRUE) > -------- 8< -------- code -------- >8 -------- Of course the program above is not complete, because it only checks for the first passage from negativ to positive. if (s[2] < 0) {} should be added before the for loop. > This code already runs for over half an hour on my system¹. > > Reading about the for loop [3] it says to try to avoid loops and I > probably should use a matrix where every row is a sample. > > Now my first problem is that there is no matrix equivalent for > cumsum(). Can I use matrices to avoid the for loop? I mean the inner for loop. Additionally I wonder if cumsum is really faster or if I should sum the elements by myself and check after every step if the walk gets non-negative/0. With a length of 1000000 this should save some cycles. On the other hand adding numbers should be really fast and adding checks in between could potentially be slower. > My second question is, is R the right choice for such simulations? It > would be great when R can also give me a confidence interval(?) and also > try to fit a curve through the result and give me the rule of > correspondence(?) [4]. Do you have pointers for those? > > I glanced at simFrame [5] and read ?simulate but could not understand > it right away and think that this might be overkill. > > Do you have any suggestions? Thanks, Paul > ¹ AMD Athlon(tm) X2 Dual Core Processor BE-2350, 2,1 GHz > > > [1] http://www-stat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.r-project.org/doc/manuals/R-intro.html#Repetitive-execution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics) > [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.       ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 In reply to this post by Paul Menzel You can replace the inner loop   for (i in 1:length) {     if (s[i] < 0 && s[i + 1] >= 0) {       z[i] = z[i] + 1     }   } with the faster   z <- z + (diff(s>=0)==1) (Using the latter forces you fix up the length of z to be one less than you declared -- your loop never touched the last entry in it.) My code relies on the factor that the logicals FALSE and TRUE are converted to integer 0 and 1, respectively, when you do arithmetic on them. E.g., here is your version written as a function, f1, and 2 equivalent ones, f2 and f3, without the inner loop: f1 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {    z = rep(0, times=length) # orig had length+1 but last entry was never used    for (i in 1:n) {            x = c(0, sign(rnorm(length)))            s = cumsum(x)            for (i in 1:length) {                    if (s[i] < 0 && s[i + 1] >= 0) {                            z[i] = z[i] + 1                    }            }    }    z } f2 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {    z = rep(0, times=length)    l1 <- length+1    for (i in 1:n) {            x = c(0, sign(rnorm(length)))            s = cumsum(x)            z <- z + ( s[-l1]<0 & s[-1]>=0 )    }    z } f3 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {    z = rep(0, times=length)    for (i in 1:n) {            x = c(0, sign(rnorm(length)))            s = cumsum(x)            z <- z + (diff(s>=0)==1)    }    z } and here are some timing and correctness tests:   > set.seed(1) ; system.time( z1 <- f1(1000, 1e5) )      user  system elapsed    278.10    0.25  271.44   > set.seed(1) ; system.time( z2 <- f2(1000, 1e5) )      user  system elapsed     37.29    0.84   38.02   > set.seed(1) ; system.time( z3 <- f3(1000, 1e5) )      user  system elapsed     40.11    0.80   40.17   > identical(z1, z2) && identical(z1, z3)   [1] TRUE You might try using sample(c(-1,1), size=length, replace=TRUE) instead of sign(rnorm(length)).  I don't know if there would be any speed difference, but it frees you from the worry that rnorm() might return an exact 0.0. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Menzel > Sent: Wednesday, July 27, 2011 4:36 PM > To: [hidden email] > Subject: [R] Is R the right choice for simulating first passage times of random walks? > > Dear R folks, > > > I need to simulate first passage times for iterated partial sums. The > related papers are for example [1][2]. > > As a start I want to simulate how long a simple random walk stays > negative, which should result that it behaves like n^(-½). My code looks > like this. > > -------- 8< -------- code -------- >8 -------- > n = 100000 # number of simulations > > length = 100000 # length of iterated sum > z = rep(0, times=length + 1) > > for (i in 1:n) { > x = c(0, sign(rnorm(length))) > s = cumsum(x) > for (i in 1:length) { > if (s[i] < 0 && s[i + 1] >= 0) { > z[i] = z[i] + 1 > } > } > } > plot(1:length(z), z/n) > curve(x**(-0.5), add = TRUE) > -------- 8< -------- code -------- >8 -------- > > This code already runs for over half an hour on my system¹. > > Reading about the for loop [3] it says to try to avoid loops and I > probably should use a matrix where every row is a sample. > > Now my first problem is that there is no matrix equivalent for > cumsum(). Can I use matrices to avoid the for loop? > > My second question is, is R the right choice for such simulations? It > would be great when R can also give me a confidence interval(?) and also > try to fit a curve through the result and give me the rule of > correspondence(?) [4]. Do you have pointers for those? > > I glanced at simFrame [5] and read ?simulate but could not understand > it right away and think that this might be overkill. > > Do you have any suggestions? > > > Thanks, > > Paul > > > ¹ AMD Athlon(tm) X2 Dual Core Processor BE-2350, 2,1 GHz > > > [1] http://www-stat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.r-project.org/doc/manuals/R-intro.html#Repetitive-execution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics) > [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Methods stopping calculations when a first element is found? (was: Is R the right choice for simulating first passage times of random walks?)

 Dear responders, thank you very much for all your answers, suggestions and corrections. I will try to give feedback and share my findings. Am Donnerstag, den 28.07.2011, 01:56 +0000 schrieb William Dunlap: […] > You might try using sample(c(-1,1), size=length, replace=TRUE) > instead of sign(rnorm(length)).  I don't know if there would > be any speed difference, but it frees you from the worry that > rnorm() might return an exact 0.0. That is a very helpful suggestions. Additionally I was also told to just use integers to reduce the memory usage [1]. First of all I have to admit and apologize, that the example code I provided did count all passing of the x-axis from negativ to positive and did not do what I intended. I only wanted to count how long the random walk stays negative. As a result I can jump out of the for loop, which is the first element in half of the cases for the simple random walk, and now trying to implement the suggested methods using the for loop seems quicker and I wonder if R provides methods to just search for a “first” element. -------- 8< -------- code -------- >8 -------- f1 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {    z = integer(length/2 + 1) # simple random walk only gets 0 on even timings    for (j in 1:n) {            x = sample(c(-1L, 1L), size=length, replace=TRUE)            if (x[1] >= 0 ) {                    z[1] = z[1] + 1L                    next            }            s = cumsum(x)            for (i in 1:length-1) {                    if (s[i] < 0 && s[i + 1] >= 0) {                            z[i/2 + 2] = z[i/2 + 2] + 1L # account for that we get positiv right away, that means a duration of 0                            break                    }            }    }    z } f2 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {    z = integer(length/2 + 1) # simple random walk only gets 0 on even timings    for (i in 1:n) {            x = c(sample(c(-1L, 1L), size=length, replace=TRUE))            if (x[1] >= 0 ) {                    z[1] = z[1] + 1L                    next            }            s = cumsum(x)            minidx = which.max( c(s[-length]<0 & s[-1]>=0, TRUE) ) # The last TRUE is there to fix the case when the random walk stays negative the whole time. \                                                                   # Everything would be FALSE then and which.max() returns 1.            if (minidx != length) {                    z[minidx/2 + 2] <- z[minidx/2 + 2] + 1L # account for that we get positiv right away, that means a duration of 0            }    }    z } -------- 8< -------- code -------- >8 -------- The code results in the following timings. > set.seed(1) ; system.time( z1 <- f1(1000, 1e5) )        User      System verstrichen      13.421       0.300      13.946 > set.seed(1) ; system.time( z2 <- f2(1000, 1e5) )        User      System verstrichen      27.281       0.616      28.010 > identical(z1, z2) As written above, using the for loops we can abort further calculation by jumping out of the loop. The R methods always compare all values and I had to adapt the trick using TRUE and FALSE when the walk stays negative all the time. Can any of you experienced users think of an idea to get rid of the inner loop? I cannot think of quicker matrix operations as done in the Wiki for a use case [2]. I will try to do some timings to get rid of the outer loop using matrices as suggested before. But I have to split it up further to avoid memory issues [1]. Thanks, Paul [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832#10[2] http://rwiki.sciviews.org/doku.php?id=tips:programming:code_optim2______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 In reply to this post by Michael Weylandt Am Mittwoch, den 27.07.2011, 19:59 -0400 schrieb R. Michael Weylandt : > Some more skilled folks can help with the curve fitting, but the general > answer is yes -- R will handle this quite ably. Great to read that. > Consider the following code: > > <<-------------------------------------->> > n = 1e5 > length = 1e5 > > R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) > R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make > your life INFINITELY better > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. > > <<---------------------------------------->> > > There are actually even faster ways to do what you are asking for, but this > introduces you to some useful R architecture, above all the apply function. Thank you very much. I realized the the 0 column is not need when summing this up. Additionally I posted the wrong example code and I actually am only interested how long it stays negative from the beginning. > To see how long the longest stretch of negatives in each row is, the > following is a little sneaky but works pretty well: > > countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))}) > > then you can study these random numbers to do whatever with them. > > The gist of this code is that it counts how many positive number have been > seen up to each point: for any stretch this doesn't increase, you must be > negative, so this identifies the longest such stretch on each row and > records the length. (It may be off by one so check it on a smaller R matrix. That is a great example. It took me a while what table() does here but thanks to your explanation I finally understood it. […] > So all together: > > <<-------------------------------------->> > n = 1e3 > length = 1e3 > > R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) > R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make > your life INFINITELY better > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. > fTemp <- function(x) { >     return(max(table(cumsum(x>=0)))) > } > countNegative = apply(R,1,fTemp) > mu = mean(countNegative) > sig = sd(countNegative)/sqrt(length(countNegative)) > > <<---------------------------------------->> > > This runs pretty fast on my laptop, but you'll need to look into the > memory.limit() function if you want to increase the simulation parameters. > There are much faster ways to handle the simulation as well, but this should > get you off to a nice start with R. > > Hope this helps, It did. Thank you again for the kind and elaborate introduction. Trying to run your example right away froze my system using n = 1000 and length = 1e5 [1]. So I really need to be careful how big such a matrix can get. One thing is to use integers as suggested in [2]. My current code looks like the following. -------- 8< -------- code -------- >8 -------- f4 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {         R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)         R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better         fTemp <- function(x) {                 if (x[1] >= 0 ) {                         return(1)                 }                 for (i in 1:length-1) {                         if (x[i] < 0 && x[i + 1] >= 0) {                                 return(as.integer(i/2 + 2)) # simple random walks only hit 0 on even “times”                         }                 }         }         countNegative = apply(R,2,fTemp)         tabulate(as.vector(countNegative), length) } -------- 8< -------- code -------- >8 -------- 1.I could actually avoid cumsum() half the time, when the first entry is already positive. So I am still looking for a way to speed that up in comparison to a simple two loops scenario. 2. The counting of the length how long the walk stayed negative is probably also inefficient and I should find a better way on how to return the values. I am still thinking about both cases, but to come up with vectoriazations of the problem is quite hard. So I welcome any suggestions. ;-) Thanks, Paul [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832#10______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Hi: See if this works for you: f4 <- function()   {      x <- sample(c(-1L,1L), 1)       if (x >= 0 ) {return(1)} else {            csum <- x            len <- 1                while(csum < 0) {                    csum <- csum + sample(c(-1, 1), 1)                    len <- len + 1                                           }     }       len   } # In one batch of repetitions of this function, system.time(out <- replicate(1000, f4()))    user  system elapsed    0.51    0.00    0.52 > range(out) [1]     1 17372 but in another (untimed), this took a significantly longer amount of time to run [for obvious reasons]: > range(out) [1]      1 987752 For 100000 repetitions, I'd guess this could run anywhere from one to several minutes, depending on the lengths of the sojourns encountered. This looks like a reasonable way to visualize the output for 1000 replications: hist(log(out), nclass = 20) Notice that the function takes no arguments, returns the length of the random walk while its cumulative sum is negative [or 1 if it starts out positive], and then uses the replicate() function to iterate the function f4() N times. HTH, Dennis On Sun, Jul 31, 2011 at 4:36 PM, Paul Menzel <[hidden email]> wrote: > Am Mittwoch, den 27.07.2011, 19:59 -0400 schrieb R. Michael Weylandt : >> Some more skilled folks can help with the curve fitting, but the general >> answer is yes -- R will handle this quite ably. > > Great to read that. > >> Consider the following code: >> >> <<-------------------------------------->> >> n = 1e5 >> length = 1e5 >> >> R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) >> R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make >> your life INFINITELY better >> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. >> >> <<---------------------------------------->> >> >> There are actually even faster ways to do what you are asking for, but this >> introduces you to some useful R architecture, above all the apply function. > > Thank you very much. I realized the the 0 column is not need when > summing this up. Additionally I posted the wrong example code and I > actually am only interested how long it stays negative from the > beginning. > >> To see how long the longest stretch of negatives in each row is, the >> following is a little sneaky but works pretty well: >> >> countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))}) >> >> then you can study these random numbers to do whatever with them. >> >> The gist of this code is that it counts how many positive number have been >> seen up to each point: for any stretch this doesn't increase, you must be >> negative, so this identifies the longest such stretch on each row and >> records the length. (It may be off by one so check it on a smaller R matrix. > > That is a great example. It took me a while what table() does here but > thanks to your explanation I finally understood it. > > […] > >> So all together: >> >> <<-------------------------------------->> >> n = 1e3 >> length = 1e3 >> >> R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) >> R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make >> your life INFINITELY better >> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. >> fTemp <- function(x) { >>     return(max(table(cumsum(x>=0)))) >> } >> countNegative = apply(R,1,fTemp) >> mu = mean(countNegative) >> sig = sd(countNegative)/sqrt(length(countNegative)) >> >> <<---------------------------------------->> >> >> This runs pretty fast on my laptop, but you'll need to look into the >> memory.limit() function if you want to increase the simulation parameters. >> There are much faster ways to handle the simulation as well, but this should >> get you off to a nice start with R. >> >> Hope this helps, > > It did. Thank you again for the kind and elaborate introduction. > > Trying to run your example right away froze my system using n = 1000 > and length = 1e5 [1]. So I really need to be careful how big such a > matrix can get. One thing is to use integers as suggested in [2]. > > My current code looks like the following. > > -------- 8< -------- code -------- >8 -------- > f4 <- function(n = 100000, # number of simulations >               length = 100000) # length of iterated sum > { >        R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n) >        R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better >        fTemp <- function(x) { >                if (x[1] >= 0 ) { >                        return(1) >                } > >                for (i in 1:length-1) { >                        if (x[i] < 0 && x[i + 1] >= 0) { >                                return(as.integer(i/2 + 2)) # simple random walks only hit 0 on even “times” >                        } >                } >        } >        countNegative = apply(R,2,fTemp) >        tabulate(as.vector(countNegative), length) > } > -------- 8< -------- code -------- >8 -------- > > 1.I could actually avoid cumsum() half the time, when the first entry > is already positive. So I am still looking for a way to speed that up in > comparison to a simple two loops scenario. > 2. The counting of the length how long the walk stayed negative is > probably also inefficient and I should find a better way on how to > return the values. > > I am still thinking about both cases, but to come up with > vectoriazations of the problem is quite hard. > > So I welcome any suggestions. ;-) > > > Thanks, > > Paul > > > [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832> [2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832#10> > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Hi, I haven't been following this thread very closely, but I'm getting the impression that the "inner loop" that's killing you folks here looks quite simple (assuming it is the one provided below). How about trying to write the of this f4 function below using the rcpp/inline combo. The C/C++ you will need to write looks to be quite trivial, let's change f4 to accept an x argument as a vector: I've defined f4 in the same way as Dennis did: > f4 <- function() >  { >     x <- sample(c(-1L,1L), 1) > >      if (x >= 0 ) {return(1)} else { >           csum <- x >           len <- 1 >               while(csum < 0) { >                   csum <- csum + sample(c(-1, 1), 1) >                   len <- len + 1 >                                          }     } >      len >  } Now, let's do some inline/c++ mojo: library(inline) inc <- " #include #include #include " fxx <-cxxfunction(includes=inc, plugin="Rcpp", body="   int len = 1;   int x = ((rand() % 2 ) == 0) ? 1 : -1;   int csum = x;   while (csum < 0) {     x = ((rand() % 2 ) == 0) ? 1 : -1;     len++;     csum = csum + x;   }   return wrap(len); ") Assuming I've faithfully translated this into c++, the timings aren't all that comparable. Doing 500 replicates with the pure R version: set.seed(123) system.time(out <- replicate(500, f4()))    user  system elapsed  31.525   0.120  32.510 Doing 10,000 replicates using the fxx function doesn't even break a sweat: system.time(outxx <- replicate(10000, fxx()))    user  system elapsed   0.371   0.001   0.373 range(out) [1]       1 1994308 range(outxx) [1]        1 11909394 Hope I'm not too off of the mark, here. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology  | Memorial Sloan-Kettering Cancer Center  | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Hi Steve: Very, very nice. Thanks for the useful Rcpp script. I'm not surprised that a C++ version blows my humble little R function out of the water :) I noticed that the R function ran a lot more slowly when the sojourns were very long. It suggests that algorithms that entail conditional iteration are quite likely to be better off written in a compiled programming language that can communicate with R. It also shows off the capabilities of the Rcpp package. Best regards, Dennis On Sun, Jul 31, 2011 at 8:32 PM, Steve Lianoglou <[hidden email]> wrote: > Hi, > > I haven't been following this thread very closely, but I'm getting the > impression that the "inner loop" that's killing you folks here looks > quite simple (assuming it is the one provided below). > > How about trying to write the of this f4 function below using the > rcpp/inline combo. The C/C++ you will need to write looks to be quite > trivial, let's change f4 to accept an x argument as a vector: > > I've defined f4 in the same way as Dennis did: > >> f4 <- function() >>  { >>     x <- sample(c(-1L,1L), 1) >> >>      if (x >= 0 ) {return(1)} else { >>           csum <- x >>           len <- 1 >>               while(csum < 0) { >>                   csum <- csum + sample(c(-1, 1), 1) >>                   len <- len + 1 >>                                          }     } >>      len >>  } > > Now, let's do some inline/c++ mojo: > > library(inline) > inc <- " > #include > #include > #include > " > > fxx <-cxxfunction(includes=inc, plugin="Rcpp", body=" >  int len = 1; >  int x = ((rand() % 2 ) == 0) ? 1 : -1; >  int csum = x; > >  while (csum < 0) { >    x = ((rand() % 2 ) == 0) ? 1 : -1; >    len++; >    csum = csum + x; >  } > >  return wrap(len); > ") > > Assuming I've faithfully translated this into c++, the timings aren't > all that comparable. > > Doing 500 replicates with the pure R version: > > set.seed(123) > system.time(out <- replicate(500, f4())) >   user  system elapsed >  31.525   0.120  32.510 > > Doing 10,000 replicates using the fxx function doesn't even break a sweat: > > system.time(outxx <- replicate(10000, fxx())) >   user  system elapsed >  0.371   0.001   0.373 > > range(out) > [1]       1 1994308 > > range(outxx) > [1]        1 11909394 > > Hope I'm not too off of the mark, here. > -steve > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology >  | Memorial Sloan-Kettering Cancer Center >  | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact> ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 In reply to this post by Paul Menzel Glad to help -- I haven't taken a look at Dennis' solution (which may be far better than mine), but if you do want to keep going down the path outlined below you might consider the following: Instead of throwing away a simulation if something starts negative, why not just multiply the entire sample by -1: that lets you still use the sample and saves you some computations: of course you'll have to remember to adjust your final results accordingly. This might avoid the loop: x = ## Whatever x is. xLag = c(0,x[-length(x)]) # 'lag' x by 1 step. which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count things, this +1 may be extraneous. The inner expression sets a 0 except where there is a switch from negative to positive and a one there: the which.max function returns the location of the first maximum, which is the first 1, in the vector. If you are guaranteed the run starts negative, then the location of the first positive should give you the length of the negative run. This all gives you, f4 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {        R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) >        R = apply(R,1,cumsum) >                       R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row >        fTemp <- function(x) { >                             xLag = c(0,x[-length(x)])                             return(which.max((x>=0) & (xLag <0))+1) >        countNegative = apply(R,2,fTemp) >        tabulate(as.vector(countNegative), length) > } That just crashed my computer though, so I wouldn't recommend it for large n,length. Instead, you can help a little by combining the lagging and the & all in one. f4 <- function(n = 100000, llength = 100000) {     R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)     R = apply(R,1,cumsum)     R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row     R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)     countNegative = apply(R,1,which.max) + 1     return (tabulate(as.vector(countNegative), length) ) > } Of course, this is all starting to approach a very specific question that could actually be approached much more efficiently if it's your end goal (though I think I remember from your first email a different end goal): We can use the symmetry and "restart"ability of RW to do the following: x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T) D  = diff(which(x == 0)) This will give you a vector of how long x stays positive or negative at a time. Thinking through some simple translations lets you see that this set has the same distribution as how long a RW that starts negative stays negative. Again, this is only good for answering a very specific question about random walks and may not be useful if you have other more complicated questions in sight. Hope this helps, Michael Weylandt On Sun, Jul 31, 2011 at 6:36 PM, Paul Menzel < [hidden email]> wrote: > Am Mittwoch, den 27.07.2011, 19:59 -0400 schrieb R. Michael Weylandt : > > Some more skilled folks can help with the curve fitting, but the general > > answer is yes -- R will handle this quite ably. > > Great to read that. > > > Consider the following code: > > > > <<-------------------------------------->> > > n = 1e5 > > length = 1e5 > > > > R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) > > R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will > make > > your life INFINITELY better > > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. > > > > <<---------------------------------------->> > > > > There are actually even faster ways to do what you are asking for, but > this > > introduces you to some useful R architecture, above all the apply > function. > > Thank you very much. I realized the the 0 column is not need when > summing this up. Additionally I posted the wrong example code and I > actually am only interested how long it stays negative from the > beginning. > > > To see how long the longest stretch of negatives in each row is, the > > following is a little sneaky but works pretty well: > > > > countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))}) > > > > then you can study these random numbers to do whatever with them. > > > > The gist of this code is that it counts how many positive number have > been > > seen up to each point: for any stretch this doesn't increase, you must be > > negative, so this identifies the longest such stretch on each row and > > records the length. (It may be off by one so check it on a smaller R > matrix. > > That is a great example. It took me a while what table() does here but > thanks to your explanation I finally understood it. > > [] > > > So all together: > > > > <<-------------------------------------->> > > n = 1e3 > > length = 1e3 > > > > R = matrix(sample(c(-1,1),length*n,replace=T),nrow=n) > > R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will > make > > your life INFINITELY better > > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired. > > fTemp <- function(x) { > >     return(max(table(cumsum(x>=0)))) > > } > > countNegative = apply(R,1,fTemp) > > mu = mean(countNegative) > > sig = sd(countNegative)/sqrt(length(countNegative)) > > > > <<---------------------------------------->> > > > > This runs pretty fast on my laptop, but you'll need to look into the > > memory.limit() function if you want to increase the simulation > parameters. > > There are much faster ways to handle the simulation as well, but this > should > > get you off to a nice start with R. > > > > Hope this helps, > > It did. Thank you again for the kind and elaborate introduction. > > Trying to run your example right away froze my system using n = 1000 > and length = 1e5 [1]. So I really need to be careful how big such a > matrix can get. One thing is to use integers as suggested in [2]. > > My current code looks like the following. > > -------- 8< -------- code -------- >8 -------- > f4 <- function(n = 100000, # number of simulations >                length = 100000) # length of iterated sum > { >        R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n) >         R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and > will make your life INFINITELY better >         fTemp <- function(x) { >                if (x[1] >= 0 ) { >                        return(1) >                } > >                for (i in 1:length-1) { >                        if (x[i] < 0 && x[i + 1] >= 0) { >                                return(as.integer(i/2 + 2)) # simple random > walks only hit 0 on even times >                        } >                } >        } >        countNegative = apply(R,2,fTemp) >        tabulate(as.vector(countNegative), length) > } > -------- 8< -------- code -------- >8 -------- > > 1.I could actually avoid cumsum() half the time, when the first entry > is already positive. So I am still looking for a way to speed that up in > comparison to a simple two loops scenario. > 2. The counting of the length how long the walk stayed negative is > probably also inefficient and I should find a better way on how to > return the values. > > I am still thinking about both cases, but to come up with > vectoriazations of the problem is quite hard. > > So I welcome any suggestions. ;-) > > > Thanks, > > Paul > > > [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832> [2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635832#10>         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt : > Glad to help -- I haven't taken a look at Dennis' solution (which may be far > better than mine), but if you do want to keep going down the path outlined > below you might consider the following: I will try Dennis’ solution right away but looked at your suggestions first. Thank you very much. > Instead of throwing away a simulation if something starts negative, why not > just multiply the entire sample by -1: that lets you still use the sample > and saves you some computations: of course you'll have to remember to adjust > your final results accordingly. That is a nice suggestion. For a symmetric random walk this is indeed possible and equivalent to looking when the walk first hits zero. > This might avoid the loop: > > x = ## Whatever x is. > xLag = c(0,x[-length(x)]) # 'lag' x by 1 step. > which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count > things, this +1 may be extraneous. > > The inner expression sets a 0 except where there is a switch from negative > to positive and a one there: the which.max function returns the location of > the first maximum, which is the first 1, in the vector. If you are > guaranteed the run starts negative, then the location of the first positive > should give you the length of the negative run. That is the same idea as from Bill [1]. The problem is, when the walk never returns to zero in a sample, which.max(»everything FALSE) returns 1 [2]. That is no problem though, when we do not have to worry about a walk starting with a positive value and adding 1 (+1) can be omitted when we count the epochs of first hitting 0 instead of the time of how long the walk stayed negative, which is always one less. Additionally my check (x>=0) & (xLag <0) is redundant when we know we start with a negative value. (x>=0) should be good enough in this case. > This all gives you, > > f4 <- function(n = 100000, # number of simulations >                length = 100000) # length of iterated sum > { >        R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) > > >        R = apply(R,1,cumsum) > > >                       R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row The line above seems to look the columns instead of rows. I think the following is correct since after the apply() above the random walks are in the columns.         R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] > >        fTemp <- function(x) { > > >                             xLag = c(0,x[-length(x)]) >                             return(which.max((x>=0) & (xLag <0))+1) > > >        countNegative = apply(R,2,fTemp) > >        tabulate(as.vector(countNegative), length) > > } > > That just crashed my computer though, so I wouldn't recommend it for large > n,length. Welcome to my world. I would have never thought that simulating random walks with a length of say a million would create that much data and push common desktop systems with let us say 4 GB of RAM to their limits. > Instead, you can help a little by combining the lagging and the & > all in one. > > f4 <- function(n = 100000, llength = 100000) > { >     R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) >     R = apply(R,1,cumsum) >     R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row >     R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0) >     countNegative = apply(R,1,which.max) + 1 >     return (tabulate(as.vector(countNegative), length) ) > > > } I left that one out, because as written above the check can be shortened. > Of course, this is all starting to approach a very specific question that > could actually be approached much more efficiently if it's your end goal > (though I think I remember from your first email a different end goal): That is true. But to learn some optimization techniques on a simple example is much appreciated and will hopefully help me later on for the iterated random walk cases. > We can use the symmetry and "restart"ability of RW to do the following: > > x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T) > D  = diff(which(x == 0)) Nice! > This will give you a vector of how long x stays positive or negative at a > time. Thinking through some simple translations lets you see that this set > has the same distribution as how long a RW that starts negative stays > negative. I have to write those translations down. On first sight though we need again to handle the case where it stays negative the whole time. D then has length 0 and we have to count that for a walk longer than BIGNUMBER. > Again, this is only good for answering a very specific question > about random walks and may not be useful if you have other more complicated > questions in sight. Just testing for 0 for the iterated cases will not be enough for iterated random walks since an iterated random walk can go from negative to non-negative without being zero at this time/epoch. I implemented all your suggestions and got the following. -------- 8< -------- code -------- >8 -------- f4 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {         R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)         R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better         fTemp <- function(x) {                 if (x[1] >= 0 ) {                         return(1)                 }                 for (i in 1:length-1) {                         if (x[i] < 0 && x[i + 1] >= 0) {                                 return(as.integer(i/2 + 2))                         }                 }         }         countNegative = apply(R,2,fTemp)         tabulate(as.vector(countNegative), length) } f5 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {         R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)         R = apply(R,1,cumsum)         R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row         R <- R>=0         countNegative = apply(R,2,which.max)         tabulate(as.vector(countNegative), length) } f6 <- function(n = 100000, # number of simulations                length = 100000) # length of iterated sum {         x = cumsum(sample(c(-1L,1L), length*n,replace=T))         D = diff(which(c(0, x) == 0))         tabulate(D, max(D)) } -------- 8< -------- code -------- >8 -------- The timings differ quite much which is expected though. > # f1 is using only for loops but only does half the calculations > # and does not yet flip random walks starting with a positive value. > set.seed(1) ; system.time( z1 <- f1(300, 1e5) )        User      System verstrichen       2.700       0.008       2.729 > # f1 adapted with flips > set.seed(1) ; system.time( z1f <- f1withflip(300, 1e5) )        User      System verstrichen       4.457       0.004       4.475 > set.seed(1) ; system.time( z4 <- f4(300, 1e5) )        User      System verstrichen       8.033       0.380       8.739 > set.seed(1) ; system.time( z5 <- f5(300, 1e5) )        User      System verstrichen       9.640       0.812      10.588 > set.seed(1) ; system.time( z6 <- f6(300, 1e5) )        User      System verstrichen       4.208       0.328       4.606 So f6 seems to be the most efficient setting right now and even is slightly faster than f1 with the for loops. But we have to keep in mind that both operate on different data sets although set.seed(1) is used and f6 treats the problem totally different. One other thought is that when reusing the walks starting with a positiv term and flipping those we can probably also take the backward/reverse walk (dual problem). I will try that too. Thank you very much, Paul [1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html[2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
Open this post in threaded view
|

## Re: Is R the right choice for simulating first passage times of random walks?

 I've only got a 20 minute layover, but three quick remarks: 1) Do a sanity check on your data size: if you want a million walks of a thousand steps, that already gets you to a billion integers to store--even at a very low bound of one byte each, thats already 1GB for the data and you still have to process it all and run the OS. If you bump this to walks of length 10k, you are in big trouble. Considered like that, it shouldn't surprise you that you are getting near memory limits. If you really do need such a large simulation and are willing to make the time/space tradeoff, it may be worth doing simulations in smaller batches (say 50-100) and aggregating the needed stats for analysis. Also, consider direct use of the rm() function for memory management. 2) If you know that which.max()==1 can't happen for your data, might this trick be easier than forcing it through some tricky logic inside the which.max() X=which.max(...) if(X[1]==1) X=Inf # or whatever value 3) I dont have any texts at hand to confirm this but isn't the expected value of the first hit time of a RW infinite? I think a  handwaving proof can be squeezed out of the optional stopping theorem with T=min(T_a,T_b) for a<0 -Inf. If I remember right, this suggests you are trying to calculate a CI for a distribution with no finite moments, a difficult task to say the least. Hope these help and I'll write a more detailed reply to your notes below later, Michael Weylandt PS - what's an iterated RW? This is all outside my field (hence my spitball on #2 above) PS2 - sorry about the row/column mix-up: I usually think of sample paths as rows... On Aug 1, 2011, at 8:49 AM, Paul Menzel <[hidden email]> wrote: > Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt : >> Glad to help -- I haven't taken a look at Dennis' solution (which may be far >> better than mine), but if you do want to keep going down the path outlined >> below you might consider the following: > > I will try Dennis’ solution right away but looked at your suggestions > first. Thank you very much. > >> Instead of throwing away a simulation if something starts negative, why not >> just multiply the entire sample by -1: that lets you still use the sample >> and saves you some computations: of course you'll have to remember to adjust >> your final results accordingly. > > That is a nice suggestion. For a symmetric random walk this is indeed > possible and equivalent to looking when the walk first hits zero. > >> This might avoid the loop: >> >> x = ## Whatever x is. >> xLag = c(0,x[-length(x)]) # 'lag' x by 1 step. >> which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count >> things, this +1 may be extraneous. >> >> The inner expression sets a 0 except where there is a switch from negative >> to positive and a one there: the which.max function returns the location of >> the first maximum, which is the first 1, in the vector. If you are >> guaranteed the run starts negative, then the location of the first positive >> should give you the length of the negative run. > > That is the same idea as from Bill [1]. The problem is, when the walk > never returns to zero in a sample, which.max(»everything FALSE) > returns 1 [2]. That is no problem though, when we do not have to worry > about a walk starting with a positive value and adding 1 (+1) can be > omitted when we count the epochs of first hitting 0 instead of the time > of how long the walk stayed negative, which is always one less. > > Additionally my check (x>=0) & (xLag <0) is redundant when we know we > start with a negative value. (x>=0) should be good enough in this > case. > >> This all gives you, >> >> f4 <- function(n = 100000, # number of simulations >>               length = 100000) # length of iterated sum >> { >>       R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) >> >>>       R = apply(R,1,cumsum) >>> >>                      R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row > > The line above seems to look the columns instead of rows. I think the > following is correct since after the apply() above the random walks > are in the columns. > >    R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] > >>>       fTemp <- function(x) { >>> >>                            xLag = c(0,x[-length(x)]) >>                            return(which.max((x>=0) & (xLag <0))+1) >> >>>       countNegative = apply(R,2,fTemp) >>>       tabulate(as.vector(countNegative), length) >>> } >> >> That just crashed my computer though, so I wouldn't recommend it for large >> n,length. > > Welcome to my world. I would have never thought that simulating random > walks with a length of say a million would create that much data and > push common desktop systems with let us say 4 GB of RAM to their limits. > >> Instead, you can help a little by combining the lagging and the & >> all in one. >> >> f4 <- function(n = 100000, llength = 100000) >> { >>    R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) >>    R = apply(R,1,cumsum) >>    R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row >>    R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0) >>    countNegative = apply(R,1,which.max) + 1 >>    return (tabulate(as.vector(countNegative), length) ) >> >>> } > > I left that one out, because as written above the check can be > shortened. > >> Of course, this is all starting to approach a very specific question that >> could actually be approached much more efficiently if it's your end goal >> (though I think I remember from your first email a different end goal): > > That is true. But to learn some optimization techniques on a simple > example is much appreciated and will hopefully help me later on for the > iterated random walk cases. > >> We can use the symmetry and "restart"ability of RW to do the following: >> >> x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T) >> D  = diff(which(x == 0)) > > Nice! > >> This will give you a vector of how long x stays positive or negative at a >> time. Thinking through some simple translations lets you see that this set >> has the same distribution as how long a RW that starts negative stays >> negative. > > I have to write those translations down. On first sight though we need > again to handle the case where it stays negative the whole time. D > then has length 0 and we have to count that for a walk longer than > BIGNUMBER. > >> Again, this is only good for answering a very specific question >> about random walks and may not be useful if you have other more complicated >> questions in sight. > > Just testing for 0 for the iterated cases will not be enough for > iterated random walks since an iterated random walk can go from negative > to non-negative without being zero at this time/epoch. > > I implemented all your suggestions and got the following. > > -------- 8< -------- code -------- >8 -------- > f4 <- function(n = 100000, # number of simulations >               length = 100000) # length of iterated sum > { >    R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n) >    R = apply(R,1,cumsum) ## this applies cumsum row-wise' to R and will make your life INFINITELY better >    fTemp <- function(x) { >        if (x[1] >= 0 ) { >            return(1) >        } > >        for (i in 1:length-1) { >            if (x[i] < 0 && x[i + 1] >= 0) { >                return(as.integer(i/2 + 2)) >            } >        } >    } >    countNegative = apply(R,2,fTemp) >    tabulate(as.vector(countNegative), length) > } > > f5 <- function(n = 100000, # number of simulations >               length = 100000) # length of iterated sum > { >    R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n) > >    R = apply(R,1,cumsum) >    R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row > >    R <- R>=0 >    countNegative = apply(R,2,which.max) >    tabulate(as.vector(countNegative), length) > } > > f6 <- function(n = 100000, # number of simulations >               length = 100000) # length of iterated sum > { >    x = cumsum(sample(c(-1L,1L), length*n,replace=T)) >    D = diff(which(c(0, x) == 0)) >    tabulate(D, max(D)) > } > -------- 8< -------- code -------- >8 -------- > > The timings differ quite much which is expected though. > >> # f1 is using only for loops but only does half the calculations >> # and does not yet flip random walks starting with a positive value. >> set.seed(1) ; system.time( z1 <- f1(300, 1e5) ) >       User      System verstrichen >      2.700       0.008       2.729 >> # f1 adapted with flips >> set.seed(1) ; system.time( z1f <- f1withflip(300, 1e5) ) >       User      System verstrichen >      4.457       0.004       4.475 >> set.seed(1) ; system.time( z4 <- f4(300, 1e5) ) >       User      System verstrichen >      8.033       0.380       8.739 >> set.seed(1) ; system.time( z5 <- f5(300, 1e5) ) >       User      System verstrichen >      9.640       0.812      10.588 >> set.seed(1) ; system.time( z6 <- f6(300, 1e5) ) >       User      System verstrichen >      4.208       0.328       4.606 > > So f6 seems to be the most efficient setting right now and even is > slightly faster than f1 with the for loops. But we have to keep in > mind that both operate on different data sets although set.seed(1) is > used and f6 treats the problem totally different. > > One other thought is that when reusing the walks starting with a positiv > term and flipping those we can probably also take the backward/reverse > walk (dual problem). I will try that too. > > > Thank you very much, > > Paul > > > [1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html> [2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
 Am Montag, den 01.08.2011, 12:43 -0400 schrieb R. Michael Weylandt : > I've only got a 20 minute layover, but three quick remarks: > > 1) Do a sanity check on your data size: if you want a million walks of > a thousand steps, that already gets you to a billion integers to > store--even at a very low bound of one byte each, thats already 1GB > for the data and you still have to process it all and run the OS. If > you bump this to walks of length 10k, you are in big trouble. > > Considered like that, it shouldn't surprise you that you are getting > near memory limits. > > If you really do need such a large simulation and are willing to make > the time/space tradeoff, it may be worth doing simulations in smaller > batches (say 50-100) and aggregating the needed stats for analysis. I already did that, saved the result and ran it again. I also found [1] and will look to do these things in parallel, since the simulations do not depend on each other. I hope I can avoid the matrix then and use just replicate(). > Also, consider direct use of the rm() function for memory management. I was looking for such a feature the last days and read to set the variables to NULL to delete them somewhere. Now I found the correct command. Thank you! > 2) If you know that which.max()==1 can't happen for your data, might > this trick be easier than forcing it through some tricky logic inside > the which.max() > > X=which.max(...) > if(X[1]==1) X=Inf # or whatever value Noted for when I need that again. > 3) I dont have any texts at hand to confirm this but isn't the > expected value of the first hit time of a RW infinite? That is indeed correct. The generating function of the first hitting time of zero T₀ is (g_T₀)(s) ≔ 1/s (1 - \sqrt(1 - s). Therefore (g_T₀)ʹ(s) ≔ 1/s² (1 - \sqrt(1 - s) + 1/s (2(1 - s))^(-½) → ∞ for s → 1. > I think a  handwaving proof can be squeezed out of the optional > stopping theorem with T=min(T_a,T_b) for a<0 -Inf. Apparently there are several ways to prove that. > If I remember right, this suggests you are trying to calculate a CI > for a distribution with no finite moments, a difficult task to say the > least. I do not know. It scares me. ;-) For the symmetric simple random walk S_n ≔ ∑_{i=0}^n X_i I want to verify (1). (1) n^(-½) ~ p_n ≔ P(max_{1 ≤ k ≤ n} S_n < 0) a(x) ~ b(x) means that the quotient converges to 1 for x → ∞. […] > PS - what's an iterated RW? This is all outside my field (hence my > spitball on #2 above) I am sorry, I meant *integrated* random walk [3][4]. Basically that is the integral (“area” – can be negative).         A_n ≔ ∑_{i=0}^n S_i = ∑_{i=0}^n (n - i + 1) X_i Being 0, S₀ and X₀ can always be omitted. So I basically just need to do one cumsum() more over the walks. > PS2 - sorry about the row/column mix-up: I usually think of sample > paths as rows... No problem at all. I already was confused that it looked differently (transposed) after the first apply(). But it made sense. Thanks, Paul [1] http://www.bioconductor.org/help/course-materials/2010/BioC2010/EfficientRProgramming.pdf[2] http://www.steinsaltz.me.uk/probA/ProbALN13.pdf[3] http://www-stat.stanford.edu/~amir/preprints/irw.ps[4] http://arxiv.org/abs/0911.5456______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment
 In reply to this post by Steve Lianoglou-6 Dear Dennis and Steve, Am Sonntag, den 31.07.2011, 23:32 -0400 schrieb Steve Lianoglou: […] > How about trying to write the of this f4 function below using the > rcpp/inline combo. The C/C++ you will need to write looks to be quite > trivial, let's change f4 to accept an x argument as a vector: > > I've defined f4 in the same way as Dennis did: > > > f4 <- function() > >  { > >     x <- sample(c(-1L,1L), 1) > > > >      if (x >= 0 ) {return(1)} else { > >           csum <- x > >           len <- 1 > >               while(csum < 0) { > >                   csum <- csum + sample(c(-1, 1), 1) > >                   len <- len + 1 > >                                          }     } > >      len > >  } > > Now, let's do some inline/c++ mojo: > > library(inline) > inc <- " > #include > #include > #include > " > > fxx <-cxxfunction(includes=inc, plugin="Rcpp", body=" >   int len = 1; >   int x = ((rand() % 2 ) == 0) ? 1 : -1; >   int csum = x; > >   while (csum < 0) { >     x = ((rand() % 2 ) == 0) ? 1 : -1; >     len++; >     csum = csum + x; >   } > >   return wrap(len); > ") > > Assuming I've faithfully translated this into c++, the timings aren't > all that comparable. > > Doing 500 replicates with the pure R version: > > set.seed(123) > system.time(out <- replicate(500, f4())) >    user  system elapsed >  31.525   0.120  32.510 > > Doing 10,000 replicates using the fxx function doesn't even break a sweat: > > system.time(outxx <- replicate(10000, fxx())) >    user  system elapsed >   0.371   0.001   0.373 > > range(out) > [1]       1 1994308 > > range(outxx) > [1]        1 11909394 thank you very much for your suggestions. This is indeed a nice speed. 1. I first had that implemented in FORTRAN (and Python) too, but turned to R for two reasons. First I wanted to use also other distributions later on and thought that it would be easier with R and that R would have that implemented as fast as possible. Secondly I thought that R would also operate faster having the right vectorization and using csum(). But I guess it is difficult to find a good model to use the advantages of R. Especially looking at top when running this example CPU is used 100 % but memory only 40 MB from 2 GB. So if one could use another data structure maybe the calculations could be done on more walks at once. 2. It is indeed possible that the walk never returns to zero, so I should make sure, that I abort the while loop after a certain length. 3. Looking at the data types I am wondering if some integer overflow(?) could happen. I could make the length variable unsigned I suppose [1]. But still csum could go from -len to 0 and for the normal random walk unsigned should not be a problem too besides that the logic/checks have to be adapted. For integrated random walks, ccsum += csum, ccsum would go from -(ccsum**2)/2 up to 0. So later on I should use probably the 64 bit data type (unsigned) long for ccsum, csum and length` to avoid those problems. Memory does not seem to be a problem. Also I need to add an additional check for the height and length in the while loop like the following.         (csum < 0) && (csum > -ULONG_MAX) && (len =< ULONG_MAX) So I came up with the following and to use unsigned I only consider that the random walk stays positive instead of negative. -------- 8< -------- code -------- >8 -------- library(inline) inc <- " #include #include #include #include " f9 <-cxxfunction(includes=inc, plugin="Rcpp", body="   unsigned long len = 1;   if ((rand() % 2 ) == 0) {     return wrap(len);   }   unsigned long x = 1;   for (unsigned long csum = x; csum > 0; csum = ((rand() % 2 ) == 0) ? csum + 1: csum - 1) {     len++;     if ((csum == ULONG_MAX) && (len == ULONG_MAX)) {       return wrap(len);     }   }   return wrap(len); ") -------- 8< -------- code -------- >8 -------- I do not know if the compiler would have optimized it that way anyway and if there is any difference (besides the overflow checks). > set.seed(1); system.time( z9_1 <- replicate(1000, f9()) )        User      System verstrichen       0.076       0.004       0.084 > range(z9_1) [1]       1 1449034 > length(z9_1) [1] 1000 Thanks, Paul [1] https://secure.wikimedia.org/wikipedia/en/wiki/Integer_(computer_science)#Common_integral_data_types______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. signature.asc (205 bytes) Download Attachment