For loop gets exponentially slower as dataset gets larger...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

For loop gets exponentially slower as dataset gets larger...

R user-3
I am running R 2.1.1 in a Microsoft Windows XP environment.
   
  I have a matrix with three vectors (“columns”) and ~2 million “rows”.  The three vectors are date_, id, and price.  The data is ordered (sorted) by code and date_.
   
  (The matrix contains daily prices for several thousand stocks, and has ~2 million “rows”. If a stock did not trade on a particular date, its price is set to “NA”)
   
  I wish to add a fourth vector that is “next_price”. (“Next price” is the current price as long as the current price is not “NA”.  If the current price is NA, the “next_price” is the next price that the security with this same ID trades.  If the stock does not trade again,  “next_price” is set to NA.)
   
  I wrote the following loop to calculate next_price.  It works as intended, but I have one problem.  When I have only 10,000 rows of data, the calculations are very fast.  However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row.
   
  Why is this happening?  What can I do to speed the calculations when running the loop on the full 2 million rows?
   
  (I am not running low on memory, but I am maxing out my CPU at 100%)
   
  Here is my code and some sample data:
   
  data<- data[order(data$code,data$date_),]
  l<-dim(data)[1]
  w<-3
  data[l,w+1]<-NA
   
  for (i in (l-1):(1)){
  data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA))
  }
   
   
  date      id         price     next_price
  6/24/2005        1635    444.7838         444.7838
  6/27/2005        1635    448.4756         448.4756
  6/28/2005        1635    455.4161         455.4161
  6/29/2005        1635    454.6658         454.6658
  6/30/2005        1635    453.9155         453.9155
  7/1/2005          1635    453.3153         453.3153
  7/4/2005          1635    NA      453.9155
  7/5/2005          1635    453.9155         453.9155
  7/6/2005          1635    453.0152         453.0152
  7/7/2005          1635    452.8651         452.8651
  7/8/2005          1635    456.0163         456.0163
  12/19/2005      1635    442.6982         442.6982
  12/20/2005      1635    446.5159         446.5159
  12/21/2005      1635    452.4714         452.4714
  12/22/2005      1635    451.074           451.074
  12/23/2005      1635    454.6453         454.6453
  12/27/2005      1635    NA      NA
  12/28/2005      1635    NA      NA
  12/1/2003        1881    66.1562           66.1562
  12/2/2003        1881    64.9192           64.9192
  12/3/2003        1881    66.0078           66.0078
  12/4/2003        1881    65.8098           65.8098
  12/5/2003        1881    64.1275           64.1275
  12/8/2003        1881    64.8697           64.8697
  12/9/2003        1881    63.5337           63.5337
  12/10/2003      1881    62.9399           62.9399

               
---------------------------------

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: For loop gets exponentially slower as dataset gets larger...

bogdan romocea
Your 2-million loop is overkill, because apparently in the (vast)
majority of cases you don't need to loop at all. You could try
something like this:
1. Split the price by id, e.g.
price.list <- split(price,id)
For each id,
2a. When price is not NA, assign it to next price _without_ using a
for loop - e.g.
next.price[!is.na(price)] <- price[!is.na(price)]
2b. Use a for loop only when price is NA, but even then work with
vectors as much as you can, for example (untested)
for (i in setdiff(which(is.na(price)),length(price))) {
        remaining.prices <- price[(i+1):length(price)]
        of.interest <- head(remaining.prices[!is.na(remaining.prices)],1)
        if (class(of.interest) == "logical") next.price[i] <- NA else
next.price[i] <- of.interest
        }
To run (2a) and (2b) you could use lapply(); to paste the bits
together try do.call("rbind",price.list). You might also want to take
a look at ?Rprof and check the archives for efficiency suggestions.


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of r user
> Sent: Tuesday, January 03, 2006 11:59 AM
> To: rhelp
> Subject: [R] For loop gets exponentially slower as dataset
> gets larger...
>
>
> I am running R 2.1.1 in a Microsoft Windows XP environment.
>
>   I have a matrix with three vectors ("columns") and ~2
> million "rows".  The three vectors are date_, id, and price.
> The data is ordered (sorted) by code and date_.
>
>   (The matrix contains daily prices for several thousand
> stocks, and has ~2 million "rows". If a stock did not trade
> on a particular date, its price is set to "NA")
>
>   I wish to add a fourth vector that is "next_price". ("Next
> price" is the current price as long as the current price is
> not "NA".  If the current price is NA, the "next_price" is
> the next price that the security with this same ID trades.
> If the stock does not trade again,  "next_price" is set to NA.)
>
>   I wrote the following loop to calculate next_price.  It
> works as intended, but I have one problem.  When I have only
> 10,000 rows of data, the calculations are very fast.
> However, when I run the loop on the full 2 million rows, it
> seems to take ~ 1 second per row.
>
>   Why is this happening?  What can I do to speed the
> calculations when running the loop on the full 2 million rows?
>
>   (I am not running low on memory, but I am maxing out my CPU at 100%)
>
>   Here is my code and some sample data:
>
>   data<- data[order(data$code,data$date_),]
>   l<-dim(data)[1]
>   w<-3
>   data[l,w+1]<-NA
>
>   for (i in (l-1):(1)){
>
> data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[
> i,2]==data[i+1,2],data[i+1,w+1],NA))
>   }
>
>
>   date      id         price     next_price
>   6/24/2005        1635    444.7838         444.7838
>   6/27/2005        1635    448.4756         448.4756
>   6/28/2005        1635    455.4161         455.4161
>   6/29/2005        1635    454.6658         454.6658
>   6/30/2005        1635    453.9155         453.9155
>   7/1/2005          1635    453.3153         453.3153
>   7/4/2005          1635    NA      453.9155
>   7/5/2005          1635    453.9155         453.9155
>   7/6/2005          1635    453.0152         453.0152
>   7/7/2005          1635    452.8651         452.8651
>   7/8/2005          1635    456.0163         456.0163
>   12/19/2005      1635    442.6982         442.6982
>   12/20/2005      1635    446.5159         446.5159
>   12/21/2005      1635    452.4714         452.4714
>   12/22/2005      1635    451.074           451.074
>   12/23/2005      1635    454.6453         454.6453
>   12/27/2005      1635    NA      NA
>   12/28/2005      1635    NA      NA
>   12/1/2003        1881    66.1562           66.1562
>   12/2/2003        1881    64.9192           64.9192
>   12/3/2003        1881    66.0078           66.0078
>   12/4/2003        1881    65.8098           65.8098
>   12/5/2003        1881    64.1275           64.1275
>   12/8/2003        1881    64.8697           64.8697
>   12/9/2003        1881    63.5337           63.5337
>   12/10/2003      1881    62.9399           62.9399
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: For loop gets exponentially slower as dataset gets larger...

Gabor Grothendieck
In reply to this post by R user-3
Accepting this stacked representation for the
moment try this.  When reordering the dates do it
in reverse order.  Then loop over all codes
applying the zoo function na.locf to the the
prices for that code.  locf stands for last
observation carried forward.  Since our dates
are reversed it will bring the next one
backwards. Finally sort back into ascending order.

library(zoo)     # needed for na.locf which also works for non-zoo objects
data <- data[order(data$code, - as.numeric(data$date_)),]
attach(data)
next_price <- price
for(i in unique(code)) next_price[code==i] <- na.locf(price[code==i], na.rm=F)
data$next_price <- next_price
data <- data[order(data$code, data$date_),]
detach()

Here it is again but this time we represent it as
a list of zoo objects with one component per code.
In the code below we split the data on code and
apply f to do that.  Note that na.locf replaces
NAs with the last observation carried forward so
by reversing the data, using na.locf and reversing
the data again we get the effect.

library(zoo)
f <- function(x) {
        z <- zoo(x$price, x$date_)
        next_price <- rev(na.locf(rev(coredata(z)), na.rm = FALSE))
        merge(z, next_price)
}
z <- lapply(split(data, data$code), f)


On 1/3/06, r user <[hidden email]> wrote:

> I am running R 2.1.1 in a Microsoft Windows XP environment.
>
>  I have a matrix with three vectors ("columns") and ~2 million "rows".  The three vectors are date_, id, and price.  The data is ordered (sorted) by code and date_.
>
>  (The matrix contains daily prices for several thousand stocks, and has ~2 million "rows". If a stock did not trade on a particular date, its price is set to "NA")
>
>  I wish to add a fourth vector that is "next_price". ("Next price" is the current price as long as the current price is not "NA".  If the current price is NA, the "next_price" is the next price that the security with this same ID trades.  If the stock does not trade again,  "next_price" is set to NA.)
>
>  I wrote the following loop to calculate next_price.  It works as intended, but I have one problem.  When I have only 10,000 rows of data, the calculations are very fast.  However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row.
>
>  Why is this happening?  What can I do to speed the calculations when running the loop on the full 2 million rows?
>
>  (I am not running low on memory, but I am maxing out my CPU at 100%)
>
>  Here is my code and some sample data:
>
>  data<- data[order(data$code,data$date_),]
>  l<-dim(data)[1]
>  w<-3
>  data[l,w+1]<-NA
>
>  for (i in (l-1):(1)){
>  data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA))
>  }
>
>
>  date      id         price     next_price
>  6/24/2005        1635    444.7838         444.7838
>  6/27/2005        1635    448.4756         448.4756
>  6/28/2005        1635    455.4161         455.4161
>  6/29/2005        1635    454.6658         454.6658
>  6/30/2005        1635    453.9155         453.9155
>  7/1/2005          1635    453.3153         453.3153
>  7/4/2005          1635    NA      453.9155
>  7/5/2005          1635    453.9155         453.9155
>  7/6/2005          1635    453.0152         453.0152
>  7/7/2005          1635    452.8651         452.8651
>  7/8/2005          1635    456.0163         456.0163
>  12/19/2005      1635    442.6982         442.6982
>  12/20/2005      1635    446.5159         446.5159
>  12/21/2005      1635    452.4714         452.4714
>  12/22/2005      1635    451.074           451.074
>  12/23/2005      1635    454.6453         454.6453
>  12/27/2005      1635    NA      NA
>  12/28/2005      1635    NA      NA
>  12/1/2003        1881    66.1562           66.1562
>  12/2/2003        1881    64.9192           64.9192
>  12/3/2003        1881    66.0078           66.0078
>  12/4/2003        1881    65.8098           65.8098
>  12/5/2003        1881    64.1275           64.1275
>  12/8/2003        1881    64.8697           64.8697
>  12/9/2003        1881    63.5337           63.5337
>  12/10/2003      1881    62.9399           62.9399
>
>
> ---------------------------------
>
>        [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html