

Dear R folks,
I need to simulate first passage times for iterated partial sums. The
related papers are for example [1][2].
As a start I want to simulate how long a simple random walk stays
negative, which should result that it behaves like n^(½). My code looks
like this.
 8<  code  >8 
n = 100000 # number of simulations
length = 100000 # length of iterated sum
z = rep(0, times=length + 1)
for (i in 1:n) {
x = c(0, sign(rnorm(length)))
s = cumsum(x)
for (i in 1:length) {
if (s[i] < 0 && s[i + 1] >= 0) {
z[i] = z[i] + 1
}
}
}
plot(1:length(z), z/n)
curve(x**(0.5), add = TRUE)
 8<  code  >8 
This code already runs for over half an hour on my system¹.
Reading about the for loop [3] it says to try to avoid loops and I
probably should use a matrix where every row is a sample.
Now my first problem is that there is no matrix equivalent for
`cumsum()`. Can I use matrices to avoid the for loop?
My second question is, is R the right choice for such simulations? It
would be great when R can also give me a confidence interval(?) and also
try to fit a curve through the result and give me the rule of
correspondence(?) [4]. Do you have pointers for those?
I glanced at simFrame [5] and read `?simulate` but could not understand
it right away and think that this might be overkill.
Do you have any suggestions?
Thanks,
Paul
¹ AMD Athlon(tm) X2 Dual Core Processor BE2350, 2,1 GHz
[1] http://wwwstat.stanford.edu/~amir/preprints/irw.ps[2] http://arxiv.org/abs/0911.5456[3] http://cran.rproject.org/doc/manuals/Rintro.html#Repetitiveexecution[4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics)
[5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Some more skilled folks can help with the curve fitting, but the general
answer is yes  R will handle this quite ably.
Consider the following code:
<<>>
n = 1e5
length = 1e5
R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
your life INFINITELY better
R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
<<>>
There are actually even faster ways to do what you are asking for, but this
introduces you to some useful R architecture, above all the apply function.
To see how long the longest stretch of negatives in each row is, the
following is a little sneaky but works pretty well:
countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))})
then you can study these random numbers to do whatever with them.
The gist of this code is that it counts how many positive number have been
seen up to each point: for any stretch this doesn't increase, you must be
negative, so this identifies the longest such stretch on each row and
records the length. (It may be off by one so check it on a smaller R matrix.
An empirical confidence interval might be given by this
mu = mean(countNegative)
sig = sd(countNegative)/sqrt(length(countNegative))
So all together:
<<>>
n = 1e3
length = 1e3
R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
your life INFINITELY better
R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
fTemp < function(x) {
return(max(table(cumsum(x>=0))))
}
countNegative = apply(R,1,fTemp)
mu = mean(countNegative)
sig = sd(countNegative)/sqrt(length(countNegative))
<<>>
This runs pretty fast on my laptop, but you'll need to look into the
memory.limit() function if you want to increase the simulation parameters.
There are much faster ways to handle the simulation as well, but this should
get you off to a nice start with R.
Hope this helps,
Michael
On Wed, Jul 27, 2011 at 7:36 PM, Paul Menzel <
[hidden email]> wrote:
> Dear R folks,
>
>
> I need to simulate first passage times for iterated partial sums. The
> related papers are for example [1][2].
>
> As a start I want to simulate how long a simple random walk stays
> negative, which should result that it behaves like n^(½). My code looks
> like this.
>
>  8<  code  >8 
> n = 100000 # number of simulations
>
> length = 100000 # length of iterated sum
> z = rep(0, times=length + 1)
>
> for (i in 1:n) {
> x = c(0, sign(rnorm(length)))
> s = cumsum(x)
> for (i in 1:length) {
> if (s[i] < 0 && s[i + 1] >= 0) {
> z[i] = z[i] + 1
> }
> }
> }
> plot(1:length(z), z/n)
> curve(x**(0.5), add = TRUE)
>  8<  code  >8 
>
> This code already runs for over half an hour on my system¹.
>
> Reading about the for loop [3] it says to try to avoid loops and I
> probably should use a matrix where every row is a sample.
>
> Now my first problem is that there is no matrix equivalent for
> `cumsum()`. Can I use matrices to avoid the for loop?
>
> My second question is, is R the right choice for such simulations? It
> would be great when R can also give me a confidence interval(?) and also
> try to fit a curve through the result and give me the rule of
> correspondence(?) [4]. Do you have pointers for those?
>
> I glanced at simFrame [5] and read `?simulate` but could not understand
> it right away and think that this might be overkill.
>
> Do you have any suggestions?
>
>
> Thanks,
>
> Paul
>
>
> ¹ AMD Athlon(tm) X2 Dual Core Processor BE2350, 2,1 GHz
>
>
> [1] http://wwwstat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3]
> http://cran.rproject.org/doc/manuals/Rintro.html#Repetitiveexecution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics)
> [5]
> http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Dear R folks,
Am Donnerstag, den 28.07.2011, 01:36 +0200 schrieb Paul Menzel:
> I need to simulate first passage times for iterated partial sums. The
> related papers are for example [1][2].
>
> As a start I want to simulate how long a simple random walk stays
> negative, which should result that it behaves like n^(½). My code looks
> like this.
>
>  8<  code  >8 
> n = 100000 # number of simulations
>
> length = 100000 # length of iterated sum
> z = rep(0, times=length + 1)
>
> for (i in 1:n) {
> x = c(0, sign(rnorm(length)))
> s = cumsum(x)
> for (i in 1:length) {
> if (s[i] < 0 && s[i + 1] >= 0) {
> z[i] = z[i] + 1
> }
> }
> }
> plot(1:length(z), z/n)
> curve(x**(0.5), add = TRUE)
>  8<  code  >8 
Of course the program above is not complete, because it only checks for
the first passage from negativ to positive. `if (s[2] < 0) {}` should be
added before the for loop.
> This code already runs for over half an hour on my system¹.
>
> Reading about the for loop [3] it says to try to avoid loops and I
> probably should use a matrix where every row is a sample.
>
> Now my first problem is that there is no matrix equivalent for
> `cumsum()`. Can I use matrices to avoid the for loop?
I mean the inner for loop.
Additionally I wonder if `cumsum` is really faster or if I should sum
the elements by myself and check after every step if the walk gets
nonnegative/0. With a length of 1000000 this should save some cycles.
On the other hand adding numbers should be really fast and adding checks
in between could potentially be slower.
> My second question is, is R the right choice for such simulations? It
> would be great when R can also give me a confidence interval(?) and also
> try to fit a curve through the result and give me the rule of
> correspondence(?) [4]. Do you have pointers for those?
>
> I glanced at simFrame [5] and read `?simulate` but could not understand
> it right away and think that this might be overkill.
>
> Do you have any suggestions?
Thanks,
Paul
> ¹ AMD Athlon(tm) X2 Dual Core Processor BE2350, 2,1 GHz
>
>
> [1] http://wwwstat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.rproject.org/doc/manuals/Rintro.html#Repetitiveexecution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics)
> [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Top posting cuz hotmail decided not to highlight...
Personally I would tend to use java or c++ for the inner loops
but you could of course later make an R package out of that.
This is especially true if your code will be used elsewhere
in a performance critical system. For example, I wrote some
c++ code for dealing with graphs nothing fancy but it
let me play with some data structure ideas and I could
then build it into standalone programs or perhaps an R
package.
Many of these things slow down due to memory incoherence
or IO long before you use up the processor. With c++
in principle anyway you have a lot of control over these things.
Once you have your results and want to anlyze them, that's
when I would use R. Dumping simulation samples to a text file
is easy to also lets you use other things like sed/grep or
vi to explore as needed.

From: [hidden email]
To: [hidden email]
Date: Thu, 28 Jul 2011 02:00:13 +0200
Subject: Re: [R] Is R the right choice for simulating first passage times of random walks?
Dear R folks,
Am Donnerstag, den 28.07.2011, 01:36 +0200 schrieb Paul Menzel:
> I need to simulate first passage times for iterated partial sums. The
> related papers are for example [1][2].
>
> As a start I want to simulate how long a simple random walk stays
> negative, which should result that it behaves like n^(½). My code looks
> like this.
>
>  8<  code  >8 
> n = 100000 # number of simulations
>
> length = 100000 # length of iterated sum
> z = rep(0, times=length + 1)
>
> for (i in 1:n) {
> x = c(0, sign(rnorm(length)))
> s = cumsum(x)
> for (i in 1:length) {
> if (s[i] < 0 && s[i + 1] >= 0) {
> z[i] = z[i] + 1
> }
> }
> }
> plot(1:length(z), z/n)
> curve(x**(0.5), add = TRUE)
>  8<  code  >8 
Of course the program above is not complete, because it only checks for
the first passage from negativ to positive. `if (s[2] < 0) {}` should be
added before the for loop.
> This code already runs for over half an hour on my system¹.
>
> Reading about the for loop [3] it says to try to avoid loops and I
> probably should use a matrix where every row is a sample.
>
> Now my first problem is that there is no matrix equivalent for
> `cumsum()`. Can I use matrices to avoid the for loop?
I mean the inner for loop.
Additionally I wonder if `cumsum` is really faster or if I should sum
the elements by myself and check after every step if the walk gets
nonnegative/0. With a length of 1000000 this should save some cycles.
On the other hand adding numbers should be really fast and adding checks
in between could potentially be slower.
> My second question is, is R the right choice for such simulations? It
> would be great when R can also give me a confidence interval(?) and also
> try to fit a curve through the result and give me the rule of
> correspondence(?) [4]. Do you have pointers for those?
>
> I glanced at simFrame [5] and read `?simulate` but could not understand
> it right away and think that this might be overkill.
>
> Do you have any suggestions?
Thanks,
Paul
> ¹ AMD Athlon(tm) X2 Dual Core Processor BE2350, 2,1 GHz
>
>
> [1] http://wwwstat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.rproject.org/doc/manuals/Rintro.html#Repetitiveexecution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics)
> [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


You can replace the inner loop
for (i in 1:length) {
if (s[i] < 0 && s[i + 1] >= 0) {
z[i] = z[i] + 1
}
}
with the faster
z < z + (diff(s>=0)==1)
(Using the latter forces you fix up the length of
z to be one less than you declared  your loop never
touched the last entry in it.)
My code relies on the factor that the logicals FALSE
and TRUE are converted to integer 0 and 1, respectively,
when you do arithmetic on them.
E.g., here is your version written as a function, f1, and
2 equivalent ones, f2 and f3, without the inner loop:
f1 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
z = rep(0, times=length) # orig had length+1 but last entry was never used
for (i in 1:n) {
x = c(0, sign(rnorm(length)))
s = cumsum(x)
for (i in 1:length) {
if (s[i] < 0 && s[i + 1] >= 0) {
z[i] = z[i] + 1
}
}
}
z
}
f2 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
z = rep(0, times=length)
l1 < length+1
for (i in 1:n) {
x = c(0, sign(rnorm(length)))
s = cumsum(x)
z < z + ( s[l1]<0 & s[1]>=0 )
}
z
}
f3 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
z = rep(0, times=length)
for (i in 1:n) {
x = c(0, sign(rnorm(length)))
s = cumsum(x)
z < z + (diff(s>=0)==1)
}
z
}
and here are some timing and correctness tests:
> set.seed(1) ; system.time( z1 < f1(1000, 1e5) )
user system elapsed
278.10 0.25 271.44
> set.seed(1) ; system.time( z2 < f2(1000, 1e5) )
user system elapsed
37.29 0.84 38.02
> set.seed(1) ; system.time( z3 < f3(1000, 1e5) )
user system elapsed
40.11 0.80 40.17
> identical(z1, z2) && identical(z1, z3)
[1] TRUE
You might try using sample(c(1,1), size=length, replace=TRUE)
instead of sign(rnorm(length)). I don't know if there would
be any speed difference, but it frees you from the worry that
rnorm() might return an exact 0.0.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> Original Message
> From: [hidden email] [mailto: [hidden email]] On Behalf Of Paul Menzel
> Sent: Wednesday, July 27, 2011 4:36 PM
> To: [hidden email]
> Subject: [R] Is R the right choice for simulating first passage times of random walks?
>
> Dear R folks,
>
>
> I need to simulate first passage times for iterated partial sums. The
> related papers are for example [1][2].
>
> As a start I want to simulate how long a simple random walk stays
> negative, which should result that it behaves like n^(½). My code looks
> like this.
>
>  8<  code  >8 
> n = 100000 # number of simulations
>
> length = 100000 # length of iterated sum
> z = rep(0, times=length + 1)
>
> for (i in 1:n) {
> x = c(0, sign(rnorm(length)))
> s = cumsum(x)
> for (i in 1:length) {
> if (s[i] < 0 && s[i + 1] >= 0) {
> z[i] = z[i] + 1
> }
> }
> }
> plot(1:length(z), z/n)
> curve(x**(0.5), add = TRUE)
>  8<  code  >8 
>
> This code already runs for over half an hour on my system¹.
>
> Reading about the for loop [3] it says to try to avoid loops and I
> probably should use a matrix where every row is a sample.
>
> Now my first problem is that there is no matrix equivalent for
> `cumsum()`. Can I use matrices to avoid the for loop?
>
> My second question is, is R the right choice for such simulations? It
> would be great when R can also give me a confidence interval(?) and also
> try to fit a curve through the result and give me the rule of
> correspondence(?) [4]. Do you have pointers for those?
>
> I glanced at simFrame [5] and read `?simulate` but could not understand
> it right away and think that this might be overkill.
>
> Do you have any suggestions?
>
>
> Thanks,
>
> Paul
>
>
> ¹ AMD Athlon(tm) X2 Dual Core Processor BE2350, 2,1 GHz
>
>
> [1] http://wwwstat.stanford.edu/~amir/preprints/irw.ps> [2] http://arxiv.org/abs/0911.5456> [3] http://cran.rproject.org/doc/manuals/Rintro.html#Repetitiveexecution> [4] https://secure.wikimedia.org/wikipedia/en/wiki/Function_(mathematics)
> [5] http://finzi.psych.upenn.edu/R/library/simFrame/html/runSimulation.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Dear responders,
thank you very much for all your answers, suggestions and corrections. I
will try to give feedback and share my findings.
Am Donnerstag, den 28.07.2011, 01:56 +0000 schrieb William Dunlap:
[…]
> You might try using sample(c(1,1), size=length, replace=TRUE)
> instead of sign(rnorm(length)). I don't know if there would
> be any speed difference, but it frees you from the worry that
> rnorm() might return an exact 0.0.
That is a very helpful suggestions. Additionally I was also told to just
use integers to reduce the memory usage [1].
First of all I have to admit and apologize, that the example code I
provided did count all passing of the xaxis from negativ to positive
and did not do what I intended. I only wanted to count how long the
random walk stays negative. As a result I can jump out of the for loop,
which is the first element in half of the cases for the simple random
walk, and now trying to implement the suggested methods using the for
loop seems quicker and I wonder if R provides methods to just search for
a “first” element.
 8<  code  >8 
f1 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
z = integer(length/2 + 1) # simple random walk only gets 0 on even timings
for (j in 1:n) {
x = sample(c(1L, 1L), size=length, replace=TRUE)
if (x[1] >= 0 ) {
z[1] = z[1] + 1L
next
}
s = cumsum(x)
for (i in 1:length1) {
if (s[i] < 0 && s[i + 1] >= 0) {
z[i/2 + 2] = z[i/2 + 2] + 1L # account for that we get positiv right away, that means a duration of 0
break
}
}
}
z
}
f2 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
z = integer(length/2 + 1) # simple random walk only gets 0 on even timings
for (i in 1:n) {
x = c(sample(c(1L, 1L), size=length, replace=TRUE))
if (x[1] >= 0 ) {
z[1] = z[1] + 1L
next
}
s = cumsum(x)
minidx = which.max( c(s[length]<0 & s[1]>=0, TRUE) ) # The last TRUE is there to fix the case when the random walk stays negative the whole time. \
# Everything would be `FALSE` then and `which.max()` returns `1`.
if (minidx != length) {
z[minidx/2 + 2] < z[minidx/2 + 2] + 1L # account for that we get positiv right away, that means a duration of 0
}
}
z
}
 8<  code  >8 
The code results in the following timings.
> set.seed(1) ; system.time( z1 < f1(1000, 1e5) )
User System verstrichen
13.421 0.300 13.946
> set.seed(1) ; system.time( z2 < f2(1000, 1e5) )
User System verstrichen
27.281 0.616 28.010
> identical(z1, z2)
As written above, using the for loops we can abort further calculation
by jumping out of the loop. The R methods always compare all values and
I had to adapt the trick using TRUE and FALSE when the walk stays
negative all the time.
Can any of you experienced users think of an idea to get rid of the
inner loop? I cannot think of quicker matrix operations as done in the
Wiki for a use case [2].
I will try to do some timings to get rid of the outer loop using
matrices as suggested before. But I have to split it up further to avoid
memory issues [1].
Thanks,
Paul
[1] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832#10[2] http://rwiki.sciviews.org/doku.php?id=tips:programming:code_optim2______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Am Mittwoch, den 27.07.2011, 19:59 0400 schrieb R. Michael Weylandt :
> Some more skilled folks can help with the curve fitting, but the general
> answer is yes  R will handle this quite ably.
Great to read that.
> Consider the following code:
>
> <<>>
> n = 1e5
> length = 1e5
>
> R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
> your life INFINITELY better
> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
>
> <<>>
>
> There are actually even faster ways to do what you are asking for, but this
> introduces you to some useful R architecture, above all the apply function.
Thank you very much. I realized the the 0 column is not need when
summing this up. Additionally I posted the wrong example code and I
actually am only interested how long it stays negative from the
beginning.
> To see how long the longest stretch of negatives in each row is, the
> following is a little sneaky but works pretty well:
>
> countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))})
>
> then you can study these random numbers to do whatever with them.
>
> The gist of this code is that it counts how many positive number have been
> seen up to each point: for any stretch this doesn't increase, you must be
> negative, so this identifies the longest such stretch on each row and
> records the length. (It may be off by one so check it on a smaller R matrix.
That is a great example. It took me a while what `table()` does here but
thanks to your explanation I finally understood it.
[…]
> So all together:
>
> <<>>
> n = 1e3
> length = 1e3
>
> R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
> your life INFINITELY better
> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
> fTemp < function(x) {
> return(max(table(cumsum(x>=0))))
> }
> countNegative = apply(R,1,fTemp)
> mu = mean(countNegative)
> sig = sd(countNegative)/sqrt(length(countNegative))
>
> <<>>
>
> This runs pretty fast on my laptop, but you'll need to look into the
> memory.limit() function if you want to increase the simulation parameters.
> There are much faster ways to handle the simulation as well, but this should
> get you off to a nice start with R.
>
> Hope this helps,
It did. Thank you again for the kind and elaborate introduction.
Trying to run your example right away froze my system using `n = 1000`
and `length = 1e5` [1]. So I really need to be careful how big such a
matrix can get. One thing is to use integers as suggested in [2].
My current code looks like the following.
 8<  code  >8 
f4 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
R = matrix(sample(c(1L,1L),length*n,replace=T),nrow=n)
R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make your life INFINITELY better
fTemp < function(x) {
if (x[1] >= 0 ) {
return(1)
}
for (i in 1:length1) {
if (x[i] < 0 && x[i + 1] >= 0) {
return(as.integer(i/2 + 2)) # simple random walks only hit 0 on even “times”
}
}
}
countNegative = apply(R,2,fTemp)
tabulate(as.vector(countNegative), length)
}
 8<  code  >8 
1.I could actually avoid `cumsum()` half the time, when the first entry
is already positive. So I am still looking for a way to speed that up in
comparison to a simple two loops scenario.
2. The counting of the length how long the walk stayed negative is
probably also inefficient and I should find a better way on how to
return the values.
I am still thinking about both cases, but to come up with
vectoriazations of the problem is quite hard.
So I welcome any suggestions. ;)
Thanks,
Paul
[1] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832[2] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832#10______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi:
See if this works for you:
f4 < function()
{
x < sample(c(1L,1L), 1)
if (x >= 0 ) {return(1)} else {
csum < x
len < 1
while(csum < 0) {
csum < csum + sample(c(1, 1), 1)
len < len + 1
} }
len
}
# In one batch of repetitions of this function,
system.time(out < replicate(1000, f4()))
user system elapsed
0.51 0.00 0.52
> range(out)
[1] 1 17372
but in another (untimed), this took a significantly longer amount of
time to run [for obvious reasons]:
> range(out)
[1] 1 987752
For 100000 repetitions, I'd guess this could run anywhere from one to
several minutes, depending on the lengths of the sojourns encountered.
This looks like a reasonable way to visualize the output for 1000
replications:
hist(log(out), nclass = 20)
Notice that the function takes no arguments, returns the length of the
random walk while its cumulative sum is negative [or 1 if it starts
out positive], and then uses the replicate() function to iterate the
function f4() N times.
HTH,
Dennis
On Sun, Jul 31, 2011 at 4:36 PM, Paul Menzel
< [hidden email]> wrote:
> Am Mittwoch, den 27.07.2011, 19:59 0400 schrieb R. Michael Weylandt :
>> Some more skilled folks can help with the curve fitting, but the general
>> answer is yes  R will handle this quite ably.
>
> Great to read that.
>
>> Consider the following code:
>>
>> <<>>
>> n = 1e5
>> length = 1e5
>>
>> R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
>> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
>> your life INFINITELY better
>> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
>>
>> <<>>
>>
>> There are actually even faster ways to do what you are asking for, but this
>> introduces you to some useful R architecture, above all the apply function.
>
> Thank you very much. I realized the the 0 column is not need when
> summing this up. Additionally I posted the wrong example code and I
> actually am only interested how long it stays negative from the
> beginning.
>
>> To see how long the longest stretch of negatives in each row is, the
>> following is a little sneaky but works pretty well:
>>
>> countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))})
>>
>> then you can study these random numbers to do whatever with them.
>>
>> The gist of this code is that it counts how many positive number have been
>> seen up to each point: for any stretch this doesn't increase, you must be
>> negative, so this identifies the longest such stretch on each row and
>> records the length. (It may be off by one so check it on a smaller R matrix.
>
> That is a great example. It took me a while what `table()` does here but
> thanks to your explanation I finally understood it.
>
> […]
>
>> So all together:
>>
>> <<>>
>> n = 1e3
>> length = 1e3
>>
>> R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
>> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make
>> your life INFINITELY better
>> R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
>> fTemp < function(x) {
>> return(max(table(cumsum(x>=0))))
>> }
>> countNegative = apply(R,1,fTemp)
>> mu = mean(countNegative)
>> sig = sd(countNegative)/sqrt(length(countNegative))
>>
>> <<>>
>>
>> This runs pretty fast on my laptop, but you'll need to look into the
>> memory.limit() function if you want to increase the simulation parameters.
>> There are much faster ways to handle the simulation as well, but this should
>> get you off to a nice start with R.
>>
>> Hope this helps,
>
> It did. Thank you again for the kind and elaborate introduction.
>
> Trying to run your example right away froze my system using `n = 1000`
> and `length = 1e5` [1]. So I really need to be careful how big such a
> matrix can get. One thing is to use integers as suggested in [2].
>
> My current code looks like the following.
>
>  8<  code  >8 
> f4 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> R = matrix(sample(c(1L,1L),length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make your life INFINITELY better
> fTemp < function(x) {
> if (x[1] >= 0 ) {
> return(1)
> }
>
> for (i in 1:length1) {
> if (x[i] < 0 && x[i + 1] >= 0) {
> return(as.integer(i/2 + 2)) # simple random walks only hit 0 on even “times”
> }
> }
> }
> countNegative = apply(R,2,fTemp)
> tabulate(as.vector(countNegative), length)
> }
>  8<  code  >8 
>
> 1.I could actually avoid `cumsum()` half the time, when the first entry
> is already positive. So I am still looking for a way to speed that up in
> comparison to a simple two loops scenario.
> 2. The counting of the length how long the walk stayed negative is
> probably also inefficient and I should find a better way on how to
> return the values.
>
> I am still thinking about both cases, but to come up with
> vectoriazations of the problem is quite hard.
>
> So I welcome any suggestions. ;)
>
>
> Thanks,
>
> Paul
>
>
> [1] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832> [2] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832#10>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi,
I haven't been following this thread very closely, but I'm getting the
impression that the "inner loop" that's killing you folks here looks
quite simple (assuming it is the one provided below).
How about trying to write the of this `f4` function below using the
rcpp/inline combo. The C/C++ you will need to write looks to be quite
trivial, let's change f4 to accept an x argument as a vector:
I've defined f4 in the same way as Dennis did:
> f4 < function()
> {
> x < sample(c(1L,1L), 1)
>
> if (x >= 0 ) {return(1)} else {
> csum < x
> len < 1
> while(csum < 0) {
> csum < csum + sample(c(1, 1), 1)
> len < len + 1
> } }
> len
> }
Now, let's do some inline/c++ mojo:
library(inline)
inc < "
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
"
fxx <cxxfunction(includes=inc, plugin="Rcpp", body="
int len = 1;
int x = ((rand() % 2 ) == 0) ? 1 : 1;
int csum = x;
while (csum < 0) {
x = ((rand() % 2 ) == 0) ? 1 : 1;
len++;
csum = csum + x;
}
return wrap(len);
")
Assuming I've faithfully translated this into c++, the timings aren't
all that comparable.
Doing 500 replicates with the pure R version:
set.seed(123)
system.time(out < replicate(500, f4()))
user system elapsed
31.525 0.120 32.510
Doing 10,000 replicates using the fxx function doesn't even break a sweat:
system.time(outxx < replicate(10000, fxx()))
user system elapsed
0.371 0.001 0.373
range(out)
[1] 1 1994308
range(outxx)
[1] 1 11909394
Hope I'm not too off of the mark, here.
steve

Steve Lianoglou
Graduate Student: Computational Systems Biology
 Memorial SloanKettering Cancer Center
 Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi Steve:
Very, very nice. Thanks for the useful Rcpp script. I'm not surprised
that a C++ version blows my humble little R function out of the water
:) I noticed that the R function ran a lot more slowly when the
sojourns were very long. It suggests that algorithms that entail
conditional iteration are quite likely to be better off written in a
compiled programming language that can communicate with R. It also
shows off the capabilities of the Rcpp package.
Best regards,
Dennis
On Sun, Jul 31, 2011 at 8:32 PM, Steve Lianoglou
< [hidden email]> wrote:
> Hi,
>
> I haven't been following this thread very closely, but I'm getting the
> impression that the "inner loop" that's killing you folks here looks
> quite simple (assuming it is the one provided below).
>
> How about trying to write the of this `f4` function below using the
> rcpp/inline combo. The C/C++ you will need to write looks to be quite
> trivial, let's change f4 to accept an x argument as a vector:
>
> I've defined f4 in the same way as Dennis did:
>
>> f4 < function()
>> {
>> x < sample(c(1L,1L), 1)
>>
>> if (x >= 0 ) {return(1)} else {
>> csum < x
>> len < 1
>> while(csum < 0) {
>> csum < csum + sample(c(1, 1), 1)
>> len < len + 1
>> } }
>> len
>> }
>
> Now, let's do some inline/c++ mojo:
>
> library(inline)
> inc < "
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> "
>
> fxx <cxxfunction(includes=inc, plugin="Rcpp", body="
> int len = 1;
> int x = ((rand() % 2 ) == 0) ? 1 : 1;
> int csum = x;
>
> while (csum < 0) {
> x = ((rand() % 2 ) == 0) ? 1 : 1;
> len++;
> csum = csum + x;
> }
>
> return wrap(len);
> ")
>
> Assuming I've faithfully translated this into c++, the timings aren't
> all that comparable.
>
> Doing 500 replicates with the pure R version:
>
> set.seed(123)
> system.time(out < replicate(500, f4()))
> user system elapsed
> 31.525 0.120 32.510
>
> Doing 10,000 replicates using the fxx function doesn't even break a sweat:
>
> system.time(outxx < replicate(10000, fxx()))
> user system elapsed
> 0.371 0.001 0.373
>
> range(out)
> [1] 1 1994308
>
> range(outxx)
> [1] 1 11909394
>
> Hope I'm not too off of the mark, here.
> steve
> 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  Memorial SloanKettering Cancer Center
>  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Glad to help  I haven't taken a look at Dennis' solution (which may be far
better than mine), but if you do want to keep going down the path outlined
below you might consider the following:
Instead of throwing away a simulation if something starts negative, why not
just multiply the entire sample by 1: that lets you still use the sample
and saves you some computations: of course you'll have to remember to adjust
your final results accordingly.
This might avoid the loop:
x = ## Whatever x is.
xLag = c(0,x[length(x)]) # 'lag' x by 1 step.
which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
things, this +1 may be extraneous.
The inner expression sets a 0 except where there is a switch from negative
to positive and a one there: the which.max function returns the location of
the first maximum, which is the first 1, in the vector. If you are
guaranteed the run starts negative, then the location of the first positive
should give you the length of the negative run. This all gives you,
f4 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum)
>
R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first
element in the row is positive, flip the entire row
> fTemp < function(x) {
>
xLag = c(0,x[length(x)])
return(which.max((x>=0) & (xLag <0))+1)
> countNegative = apply(R,2,fTemp)
> tabulate(as.vector(countNegative), length)
> }
That just crashed my computer though, so I wouldn't recommend it for large
n,length. Instead, you can help a little by combining the lagging and the &
all in one.
f4 < function(n = 100000, llength = 100000)
{
R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
R = apply(R,1,cumsum)
R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first element in the row
is positive, flip the entire row
R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
countNegative = apply(R,1,which.max) + 1
return (tabulate(as.vector(countNegative), length) )
> }
Of course, this is all starting to approach a very specific question that
could actually be approached much more efficiently if it's your end goal
(though I think I remember from your first email a different end goal):
We can use the symmetry and "restart"ability of RW to do the following:
x = cumsum(sample(c(1L,1L),BIGNUMBER,replace=T)
D = diff(which(x == 0))
This will give you a vector of how long x stays positive or negative at a
time. Thinking through some simple translations lets you see that this set
has the same distribution as how long a RW that starts negative stays
negative. Again, this is only good for answering a very specific question
about random walks and may not be useful if you have other more complicated
questions in sight.
Hope this helps,
Michael Weylandt
On Sun, Jul 31, 2011 at 6:36 PM, Paul Menzel <
[hidden email]> wrote:
> Am Mittwoch, den 27.07.2011, 19:59 0400 schrieb R. Michael Weylandt :
> > Some more skilled folks can help with the curve fitting, but the general
> > answer is yes  R will handle this quite ably.
>
> Great to read that.
>
> > Consider the following code:
> >
> > <<>>
> > n = 1e5
> > length = 1e5
> >
> > R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
> > R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will
> make
> > your life INFINITELY better
> > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
> >
> > <<>>
> >
> > There are actually even faster ways to do what you are asking for, but
> this
> > introduces you to some useful R architecture, above all the apply
> function.
>
> Thank you very much. I realized the the 0 column is not need when
> summing this up. Additionally I posted the wrong example code and I
> actually am only interested how long it stays negative from the
> beginning.
>
> > To see how long the longest stretch of negatives in each row is, the
> > following is a little sneaky but works pretty well:
> >
> > countNegative = apply(R,1,function(x){which.max(table(cumsum(x>=0))})
> >
> > then you can study these random numbers to do whatever with them.
> >
> > The gist of this code is that it counts how many positive number have
> been
> > seen up to each point: for any stretch this doesn't increase, you must be
> > negative, so this identifies the longest such stretch on each row and
> > records the length. (It may be off by one so check it on a smaller R
> matrix.
>
> That is a great example. It took me a while what `table()` does here but
> thanks to your explanation I finally understood it.
>
> [
]
>
> > So all together:
> >
> > <<>>
> > n = 1e3
> > length = 1e3
> >
> > R = matrix(sample(c(1,1),length*n,replace=T),nrow=n)
> > R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will
> make
> > your life INFINITELY better
> > R = cbind(rep(0,n),R) ## Now each row is a random walk as you desired.
> > fTemp < function(x) {
> > return(max(table(cumsum(x>=0))))
> > }
> > countNegative = apply(R,1,fTemp)
> > mu = mean(countNegative)
> > sig = sd(countNegative)/sqrt(length(countNegative))
> >
> > <<>>
> >
> > This runs pretty fast on my laptop, but you'll need to look into the
> > memory.limit() function if you want to increase the simulation
> parameters.
> > There are much faster ways to handle the simulation as well, but this
> should
> > get you off to a nice start with R.
> >
> > Hope this helps,
>
> It did. Thank you again for the kind and elaborate introduction.
>
> Trying to run your example right away froze my system using `n = 1000`
> and `length = 1e5` [1]. So I really need to be careful how big such a
> matrix can get. One thing is to use integers as suggested in [2].
>
> My current code looks like the following.
>
>  8<  code  >8 
> f4 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> R = matrix(sample(c(1L,1L),length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and
> will make your life INFINITELY better
> fTemp < function(x) {
> if (x[1] >= 0 ) {
> return(1)
> }
>
> for (i in 1:length1) {
> if (x[i] < 0 && x[i + 1] >= 0) {
> return(as.integer(i/2 + 2)) # simple random
> walks only hit 0 on even times
> }
> }
> }
> countNegative = apply(R,2,fTemp)
> tabulate(as.vector(countNegative), length)
> }
>  8<  code  >8 
>
> 1.I could actually avoid `cumsum()` half the time, when the first entry
> is already positive. So I am still looking for a way to speed that up in
> comparison to a simple two loops scenario.
> 2. The counting of the length how long the walk stayed negative is
> probably also inefficient and I should find a better way on how to
> return the values.
>
> I am still thinking about both cases, but to come up with
> vectoriazations of the problem is quite hard.
>
> So I welcome any suggestions. ;)
>
>
> Thanks,
>
> Paul
>
>
> [1] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832> [2] http://bugs.debian.org/cgibin/bugreport.cgi?bug=635832#10>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Am Sonntag, den 31.07.2011, 23:32 0500 schrieb R. Michael Weylandt :
> Glad to help  I haven't taken a look at Dennis' solution (which may be far
> better than mine), but if you do want to keep going down the path outlined
> below you might consider the following:
I will try Dennis’ solution right away but looked at your suggestions
first. Thank you very much.
> Instead of throwing away a simulation if something starts negative, why not
> just multiply the entire sample by 1: that lets you still use the sample
> and saves you some computations: of course you'll have to remember to adjust
> your final results accordingly.
That is a nice suggestion. For a symmetric random walk this is indeed
possible and equivalent to looking when the walk first hits zero.
> This might avoid the loop:
>
> x = ## Whatever x is.
> xLag = c(0,x[length(x)]) # 'lag' x by 1 step.
> which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
> things, this +1 may be extraneous.
>
> The inner expression sets a 0 except where there is a switch from negative
> to positive and a one there: the which.max function returns the location of
> the first maximum, which is the first 1, in the vector. If you are
> guaranteed the run starts negative, then the location of the first positive
> should give you the length of the negative run.
That is the same idea as from Bill [1]. The problem is, when the walk
never returns to zero in a sample, `which.max(»everything FALSE)`
returns 1 [2]. That is no problem though, when we do not have to worry
about a walk starting with a positive value and adding 1 (+1) can be
omitted when we count the epochs of first hitting 0 instead of the time
of how long the walk stayed negative, which is always one less.
Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
start with a negative value. `(x>=0)` should be good enough in this
case.
> This all gives you,
>
> f4 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
>
> > R = apply(R,1,cumsum)
> >
> R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first element in the row is positive, flip the entire row
The line above seems to look the columns instead of rows. I think the
following is correct since after the `apply()` above the random walks
are in the columns.
R[,R[1,]==(1)] = 1 * R[,R[1,]==(1)]
> > fTemp < function(x) {
> >
> xLag = c(0,x[length(x)])
> return(which.max((x>=0) & (xLag <0))+1)
>
> > countNegative = apply(R,2,fTemp)
> > tabulate(as.vector(countNegative), length)
> > }
>
> That just crashed my computer though, so I wouldn't recommend it for large
> n,length.
Welcome to my world. I would have never thought that simulating random
walks with a length of say a million would create that much data and
push common desktop systems with let us say 4 GB of RAM to their limits.
> Instead, you can help a little by combining the lagging and the &
> all in one.
>
> f4 < function(n = 100000, llength = 100000)
> {
> R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum)
> R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first element in the row is positive, flip the entire row
> R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
> countNegative = apply(R,1,which.max) + 1
> return (tabulate(as.vector(countNegative), length) )
>
> > }
I left that one out, because as written above the check can be
shortened.
> Of course, this is all starting to approach a very specific question that
> could actually be approached much more efficiently if it's your end goal
> (though I think I remember from your first email a different end goal):
That is true. But to learn some optimization techniques on a simple
example is much appreciated and will hopefully help me later on for the
iterated random walk cases.
> We can use the symmetry and "restart"ability of RW to do the following:
>
> x = cumsum(sample(c(1L,1L),BIGNUMBER,replace=T)
> D = diff(which(x == 0))
Nice!
> This will give you a vector of how long x stays positive or negative at a
> time. Thinking through some simple translations lets you see that this set
> has the same distribution as how long a RW that starts negative stays
> negative.
I have to write those translations down. On first sight though we need
again to handle the case where it stays negative the whole time. `D`
then has length 0 and we have to count that for a walk longer than
`BIGNUMBER`.
> Again, this is only good for answering a very specific question
> about random walks and may not be useful if you have other more complicated
> questions in sight.
Just testing for 0 for the iterated cases will not be enough for
iterated random walks since an iterated random walk can go from negative
to nonnegative without being zero at this time/epoch.
I implemented all your suggestions and got the following.
 8<  code  >8 
f4 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
R = matrix(sample(c(1L,1L),length*n,replace=T),nrow=n)
R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make your life INFINITELY better
fTemp < function(x) {
if (x[1] >= 0 ) {
return(1)
}
for (i in 1:length1) {
if (x[i] < 0 && x[i + 1] >= 0) {
return(as.integer(i/2 + 2))
}
}
}
countNegative = apply(R,2,fTemp)
tabulate(as.vector(countNegative), length)
}
f5 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
R = apply(R,1,cumsum)
R[,R[1,]==(1)] = 1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row
R < R>=0
countNegative = apply(R,2,which.max)
tabulate(as.vector(countNegative), length)
}
f6 < function(n = 100000, # number of simulations
length = 100000) # length of iterated sum
{
x = cumsum(sample(c(1L,1L), length*n,replace=T))
D = diff(which(c(0, x) == 0))
tabulate(D, max(D))
}
 8<  code  >8 
The timings differ quite much which is expected though.
> # f1 is using only for loops but only does half the calculations
> # and does not yet flip random walks starting with a positive value.
> set.seed(1) ; system.time( z1 < f1(300, 1e5) )
User System verstrichen
2.700 0.008 2.729
> # f1 adapted with flips
> set.seed(1) ; system.time( z1f < f1withflip(300, 1e5) )
User System verstrichen
4.457 0.004 4.475
> set.seed(1) ; system.time( z4 < f4(300, 1e5) )
User System verstrichen
8.033 0.380 8.739
> set.seed(1) ; system.time( z5 < f5(300, 1e5) )
User System verstrichen
9.640 0.812 10.588
> set.seed(1) ; system.time( z6 < f6(300, 1e5) )
User System verstrichen
4.208 0.328 4.606
So `f6` seems to be the most efficient setting right now and even is
slightly faster than `f1` with the for loops. But we have to keep in
mind that both operate on different data sets although `set.seed(1)` is
used and `f6` treats the problem totally different.
One other thought is that when reusing the walks starting with a positiv
term and flipping those we can probably also take the backward/reverse
walk (dual problem). I will try that too.
Thank you very much,
Paul
[1] https://stat.ethz.ch/pipermail/rhelp/2011July/285015.html[2] https://stat.ethz.ch/pipermail/rhelp/2011July/285396.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I've only got a 20 minute layover, but three quick remarks:
1) Do a sanity check on your data size: if you want a million walks of a thousand steps, that already gets you to a billion integers to storeeven at a very low bound of one byte each, thats already 1GB for the data and you still have to process it all and run the OS. If you bump this to walks of length 10k, you are in big trouble.
Considered like that, it shouldn't surprise you that you are getting near memory limits.
If you really do need such a large simulation and are willing to make the time/space tradeoff, it may be worth doing simulations in smaller batches (say 50100) and aggregating the needed stats for analysis. Also, consider direct use of the rm() function for memory management.
2) If you know that which.max()==1 can't happen for your data, might this trick be easier than forcing it through some tricky logic inside the which.max()
X=which.max(...)
if(X[1]==1) X=Inf # or whatever value
3) I dont have any texts at hand to confirm this but isn't the expected value of the first hit time of a RW infinite? I think a handwaving proof can be squeezed out of the optional stopping theorem with T=min(T_a,T_b) for a<0<b and let a > Inf.
If I remember right, this suggests you are trying to calculate a CI for a distribution with no finite moments, a difficult task to say the least.
Hope these help and I'll write a more detailed reply to your notes below later,
Michael Weylandt
PS  what's an iterated RW? This is all outside my field (hence my spitball on #2 above)
PS2  sorry about the row/column mixup: I usually think of sample paths as rows...
On Aug 1, 2011, at 8:49 AM, Paul Menzel < [hidden email]> wrote:
> Am Sonntag, den 31.07.2011, 23:32 0500 schrieb R. Michael Weylandt :
>> Glad to help  I haven't taken a look at Dennis' solution (which may be far
>> better than mine), but if you do want to keep going down the path outlined
>> below you might consider the following:
>
> I will try Dennis’ solution right away but looked at your suggestions
> first. Thank you very much.
>
>> Instead of throwing away a simulation if something starts negative, why not
>> just multiply the entire sample by 1: that lets you still use the sample
>> and saves you some computations: of course you'll have to remember to adjust
>> your final results accordingly.
>
> That is a nice suggestion. For a symmetric random walk this is indeed
> possible and equivalent to looking when the walk first hits zero.
>
>> This might avoid the loop:
>>
>> x = ## Whatever x is.
>> xLag = c(0,x[length(x)]) # 'lag' x by 1 step.
>> which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
>> things, this +1 may be extraneous.
>>
>> The inner expression sets a 0 except where there is a switch from negative
>> to positive and a one there: the which.max function returns the location of
>> the first maximum, which is the first 1, in the vector. If you are
>> guaranteed the run starts negative, then the location of the first positive
>> should give you the length of the negative run.
>
> That is the same idea as from Bill [1]. The problem is, when the walk
> never returns to zero in a sample, `which.max(»everything FALSE)`
> returns 1 [2]. That is no problem though, when we do not have to worry
> about a walk starting with a positive value and adding 1 (+1) can be
> omitted when we count the epochs of first hitting 0 instead of the time
> of how long the walk stayed negative, which is always one less.
>
> Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
> start with a negative value. `(x>=0)` should be good enough in this
> case.
>
>> This all gives you,
>>
>> f4 < function(n = 100000, # number of simulations
>> length = 100000) # length of iterated sum
>> {
>> R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
>>
>>> R = apply(R,1,cumsum)
>>>
>> R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first element in the row is positive, flip the entire row
>
> The line above seems to look the columns instead of rows. I think the
> following is correct since after the `apply()` above the random walks
> are in the columns.
>
> R[,R[1,]==(1)] = 1 * R[,R[1,]==(1)]
>
>>> fTemp < function(x) {
>>>
>> xLag = c(0,x[length(x)])
>> return(which.max((x>=0) & (xLag <0))+1)
>>
>>> countNegative = apply(R,2,fTemp)
>>> tabulate(as.vector(countNegative), length)
>>> }
>>
>> That just crashed my computer though, so I wouldn't recommend it for large
>> n,length.
>
> Welcome to my world. I would have never thought that simulating random
> walks with a length of say a million would create that much data and
> push common desktop systems with let us say 4 GB of RAM to their limits.
>
>> Instead, you can help a little by combining the lagging and the &
>> all in one.
>>
>> f4 < function(n = 100000, llength = 100000)
>> {
>> R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
>> R = apply(R,1,cumsum)
>> R[R[,1]==(1),] = 1 * R[R[,1]==(1),] # If the first element in the row is positive, flip the entire row
>> R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
>> countNegative = apply(R,1,which.max) + 1
>> return (tabulate(as.vector(countNegative), length) )
>>
>>> }
>
> I left that one out, because as written above the check can be
> shortened.
>
>> Of course, this is all starting to approach a very specific question that
>> could actually be approached much more efficiently if it's your end goal
>> (though I think I remember from your first email a different end goal):
>
> That is true. But to learn some optimization techniques on a simple
> example is much appreciated and will hopefully help me later on for the
> iterated random walk cases.
>
>> We can use the symmetry and "restart"ability of RW to do the following:
>>
>> x = cumsum(sample(c(1L,1L),BIGNUMBER,replace=T)
>> D = diff(which(x == 0))
>
> Nice!
>
>> This will give you a vector of how long x stays positive or negative at a
>> time. Thinking through some simple translations lets you see that this set
>> has the same distribution as how long a RW that starts negative stays
>> negative.
>
> I have to write those translations down. On first sight though we need
> again to handle the case where it stays negative the whole time. `D`
> then has length 0 and we have to count that for a walk longer than
> `BIGNUMBER`.
>
>> Again, this is only good for answering a very specific question
>> about random walks and may not be useful if you have other more complicated
>> questions in sight.
>
> Just testing for 0 for the iterated cases will not be enough for
> iterated random walks since an iterated random walk can go from negative
> to nonnegative without being zero at this time/epoch.
>
> I implemented all your suggestions and got the following.
>
>  8<  code  >8 
> f4 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> R = matrix(sample(c(1L,1L),length*n,replace=T),nrow=n)
> R = apply(R,1,cumsum) ## this applies cumsum `rowwise' to R and will make your life INFINITELY better
> fTemp < function(x) {
> if (x[1] >= 0 ) {
> return(1)
> }
>
> for (i in 1:length1) {
> if (x[i] < 0 && x[i + 1] >= 0) {
> return(as.integer(i/2 + 2))
> }
> }
> }
> countNegative = apply(R,2,fTemp)
> tabulate(as.vector(countNegative), length)
> }
>
> f5 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> R = matrix(sample(c(1L,1L), length*n,replace=T),nrow=n)
>
> R = apply(R,1,cumsum)
> R[,R[1,]==(1)] = 1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row
>
> R < R>=0
> countNegative = apply(R,2,which.max)
> tabulate(as.vector(countNegative), length)
> }
>
> f6 < function(n = 100000, # number of simulations
> length = 100000) # length of iterated sum
> {
> x = cumsum(sample(c(1L,1L), length*n,replace=T))
> D = diff(which(c(0, x) == 0))
> tabulate(D, max(D))
> }
>  8<  code  >8 
>
> The timings differ quite much which is expected though.
>
>> # f1 is using only for loops but only does half the calculations
>> # and does not yet flip random walks starting with a positive value.
>> set.seed(1) ; system.time( z1 < f1(300, 1e5) )
> User System verstrichen
> 2.700 0.008 2.729
>> # f1 adapted with flips
>> set.seed(1) ; system.time( z1f < f1withflip(300, 1e5) )
> User System verstrichen
> 4.457 0.004 4.475
>> set.seed(1) ; system.time( z4 < f4(300, 1e5) )
> User System verstrichen
> 8.033 0.380 8.739
>> set.seed(1) ; system.time( z5 < f5(300, 1e5) )
> User System verstrichen
> 9.640 0.812 10.588
>> set.seed(1) ; system.time( z6 < f6(300, 1e5) )
> User System verstrichen
> 4.208 0.328 4.606
>
> So `f6` seems to be the most efficient setting right now and even is
> slightly faster than `f1` with the for loops. But we have to keep in
> mind that both operate on different data sets although `set.seed(1)` is
> used and `f6` treats the problem totally different.
>
> One other thought is that when reusing the walks starting with a positiv
> term and flipping those we can probably also take the backward/reverse
> walk (dual problem). I will try that too.
>
>
> Thank you very much,
>
> Paul
>
>
> [1] https://stat.ethz.ch/pipermail/rhelp/2011July/285015.html> [2] https://stat.ethz.ch/pipermail/rhelp/2011July/285396.html______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Am Montag, den 01.08.2011, 12:43 0400 schrieb R. Michael Weylandt :
> I've only got a 20 minute layover, but three quick remarks:
>
> 1) Do a sanity check on your data size: if you want a million walks of
> a thousand steps, that already gets you to a billion integers to
> storeeven at a very low bound of one byte each, thats already 1GB
> for the data and you still have to process it all and run the OS. If
> you bump this to walks of length 10k, you are in big trouble.
>
> Considered like that, it shouldn't surprise you that you are getting
> near memory limits.
>
> If you really do need such a large simulation and are willing to make
> the time/space tradeoff, it may be worth doing simulations in smaller
> batches (say 50100) and aggregating the needed stats for analysis.
I already did that, saved the result and ran it again. I also found [1]
and will look to do these things in parallel, since the simulations do
not depend on each other. I hope I can avoid the matrix then and use
just `replicate()`.
> Also, consider direct use of the rm() function for memory management.
I was looking for such a feature the last days and read to set the
variables to `NULL` to delete them somewhere. Now I found the correct
command. Thank you!
> 2) If you know that which.max()==1 can't happen for your data, might
> this trick be easier than forcing it through some tricky logic inside
> the which.max()
>
> X=which.max(...)
> if(X[1]==1) X=Inf # or whatever value
Noted for when I need that again.
> 3) I dont have any texts at hand to confirm this but isn't the
> expected value of the first hit time of a RW infinite?
That is indeed correct. The generating function of the first hitting
time of zero T₀ is (g_T₀)(s) ≔ 1/s (1  \sqrt(1  s). Therefore
(g_T₀)ʹ(s) ≔ 1/s² (1  \sqrt(1  s) + 1/s (2(1  s))^(½) → ∞ for s → 1.
> I think a handwaving proof can be squeezed out of the optional
> stopping theorem with T=min(T_a,T_b) for a<0<b and let a > Inf.
Apparently there are several ways to prove that.
> If I remember right, this suggests you are trying to calculate a CI
> for a distribution with no finite moments, a difficult task to say the
> least.
I do not know. It scares me. ;) For the symmetric simple random walk
S_n ≔ ∑_{i=0}^n X_i I want to verify (1).
(1) n^(½) ~ p_n ≔ P(max_{1 ≤ k ≤ n} S_n < 0)
a(x) ~ b(x) means that the quotient converges to 1 for x → ∞.
[…]
> PS  what's an iterated RW? This is all outside my field (hence my
> spitball on #2 above)
I am sorry, I meant *integrated* random walk [3][4]. Basically that is
the integral (“area” – can be negative).
A_n ≔ ∑_{i=0}^n S_i = ∑_{i=0}^n (n  i + 1) X_i
Being 0, S₀ and X₀ can always be omitted. So I basically just need to do
one `cumsum()` more over the walks.
> PS2  sorry about the row/column mixup: I usually think of sample
> paths as rows...
No problem at all. I already was confused that it looked differently
(transposed) after the first `apply()`. But it made sense.
Thanks,
Paul
[1] http://www.bioconductor.org/help/coursematerials/2010/BioC2010/EfficientRProgramming.pdf[2] http://www.steinsaltz.me.uk/probA/ProbALN13.pdf[3] http://wwwstat.stanford.edu/~amir/preprints/irw.ps[4] http://arxiv.org/abs/0911.5456______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Dear Dennis and Steve,
Am Sonntag, den 31.07.2011, 23:32 0400 schrieb Steve Lianoglou:
[…]
> How about trying to write the of this `f4` function below using the
> rcpp/inline combo. The C/C++ you will need to write looks to be quite
> trivial, let's change f4 to accept an x argument as a vector:
>
> I've defined f4 in the same way as Dennis did:
>
> > f4 < function()
> > {
> > x < sample(c(1L,1L), 1)
> >
> > if (x >= 0 ) {return(1)} else {
> > csum < x
> > len < 1
> > while(csum < 0) {
> > csum < csum + sample(c(1, 1), 1)
> > len < len + 1
> > } }
> > len
> > }
>
> Now, let's do some inline/c++ mojo:
>
> library(inline)
> inc < "
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> "
>
> fxx <cxxfunction(includes=inc, plugin="Rcpp", body="
> int len = 1;
> int x = ((rand() % 2 ) == 0) ? 1 : 1;
> int csum = x;
>
> while (csum < 0) {
> x = ((rand() % 2 ) == 0) ? 1 : 1;
> len++;
> csum = csum + x;
> }
>
> return wrap(len);
> ")
>
> Assuming I've faithfully translated this into c++, the timings aren't
> all that comparable.
>
> Doing 500 replicates with the pure R version:
>
> set.seed(123)
> system.time(out < replicate(500, f4()))
> user system elapsed
> 31.525 0.120 32.510
>
> Doing 10,000 replicates using the fxx function doesn't even break a sweat:
>
> system.time(outxx < replicate(10000, fxx()))
> user system elapsed
> 0.371 0.001 0.373
>
> range(out)
> [1] 1 1994308
>
> range(outxx)
> [1] 1 11909394
thank you very much for your suggestions.
This is indeed a nice speed.
1. I first had that implemented in FORTRAN (and Python) too, but turned
to R for two reasons. First I wanted to use also other distributions
later on and thought that it would be easier with R and that R would
have that implemented as fast as possible. Secondly I thought that R
would also operate faster having the right vectorization and using
`csum()`. But I guess it is difficult to find a good model to use the
advantages of R.
Especially looking at `top` when running this example CPU is used 100 %
but memory only 40 MB from 2 GB. So if one could use another data
structure maybe the calculations could be done on more walks at once.
2. It is indeed possible that the walk never returns to zero, so I
should make sure, that I abort the while loop after a certain length.
3. Looking at the data types I am wondering if some integer overflow(?)
could happen. I could make the length variable unsigned I suppose [1].
But still `csum` could go from `len` to 0 and for the normal random
walk unsigned should not be a problem too besides that the logic/checks
have to be adapted.
For integrated random walks, `ccsum += csum`, `ccsum` would go from
(ccsum**2)/2 up to 0. So later on I should use probably the 64 bit data
type (unsigned) `long` for `ccsum`, `csum` and `length` to avoid those
problems. Memory does not seem to be a problem. Also I need to add an
additional check for the height and length in the while loop like the
following.
(csum < 0) && (csum > ULONG_MAX) && (len =< ULONG_MAX)
So I came up with the following and to use unsigned I only consider that
the random walk stays positive instead of negative.
 8<  code  >8 
library(inline)
inc < "
#include <climits>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
"
f9 <cxxfunction(includes=inc, plugin="Rcpp", body="
unsigned long len = 1;
if ((rand() % 2 ) == 0) {
return wrap(len);
}
unsigned long x = 1;
for (unsigned long csum = x; csum > 0; csum = ((rand() % 2 ) == 0) ? csum + 1: csum  1) {
len++;
if ((csum == ULONG_MAX) && (len == ULONG_MAX)) {
return wrap(len);
}
}
return wrap(len);
")
 8<  code  >8 
I do not know if the compiler would have optimized it that way anyway
and if there is any difference (besides the overflow checks).
> set.seed(1); system.time( z9_1 < replicate(1000, f9()) )
User System verstrichen
0.076 0.004 0.084
> range(z9_1)
[1] 1 1449034
> length(z9_1)
[1] 1000
Thanks,
Paul
[1] https://secure.wikimedia.org/wikipedia/en/wiki/Integer_(computer_science)#Common_integral_data_types______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

