Hello Everyone,
I just started a new job & it requires heavy use of R to analyze datasets. I have a data.table that looks like this. It is sorted by ID & Date, there are about 150 different IDs & the dataset spans 3 million rows. The main columns of concern are ID, date, and totret. What I need to do is to derive daily returns for each ID from totret, which is simply totret at time t+1 divided by totret at time t. X id ticker date_ adjClose totret RankStk 427225 427225 00174410 AHS 2001-11-13 21.66 100.00000 1235 441910 441910 00174410 AHS 2001-11-14 21.60 99.72300 1235 458458 458458 00174410 AHS 2001-11-15 21.65 99.95380 1235 284003 284003 00174410 AHS 2001-11-16 21.59 99.67680 1235 Two problems for me: 1)I can't just apply it to the entire column since there will be problems at the boundary points where the ID changes from 1 to another. I need to find out how to specify a restriction on the name of the ID 2) From Java, instinctively I would use a loop to calculate daily returns, but I found out that R is very slow with loops, so I need to find an efficient way to calculate daily returns with such a huge dataset. Thanks a lot! |
Hello Jeff,
Try this: test <- data.frame(totret=rnorm(10^7)) #create some sample data test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"] The general idea is to take the column "totret" excluding the first 1, dividided by "totret" exluding the last row. This gives in effect t+1 (since t is now shorter)/t I assigned the result to a new column "dailyreturn". For 10^7 rows, it tooks 1.92 seconds on my system. HTH, Josh On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <[hidden email]> wrote: > > Hello Everyone, > > I just started a new job & it requires heavy use of R to analyze datasets. > > I have a data.table that looks like this. It is sorted by ID & Date, there > are about 150 different IDs & the dataset spans 3 million rows. The main > columns of concern are ID, date, and totret. What I need to do is to derive > daily returns for each ID from totret, which is simply totret at time t+1 > divided by totret at time t. > > X id ticker date_ adjClose totret RankStk > 427225 427225 00174410 AHS 2001-11-13 21.66 100.00000 1235 > 441910 441910 00174410 AHS 2001-11-14 21.60 99.72300 1235 > 458458 458458 00174410 AHS 2001-11-15 21.65 99.95380 1235 > 284003 284003 00174410 AHS 2001-11-16 21.59 99.67680 1235 > > Two problems for me: > > 1)I can't just apply it to the entire column since there will be problems at > the boundary points where the ID changes from 1 to another. I need to find > out how to specify a restriction on the name of the ID > > 2) From Java, instinctively I would use a loop to calculate daily returns, > but I found out that R is very slow with loops, so I need to find an > efficient way to calculate daily returns with such a huge dataset. > > Thanks a lot! > > > -- > View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joshua Wiley Senior in Psychology University of California, Riverside http://www.joshuawiley.com/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Hey Josh, Thanks for the quick response! I guess I have to switch from the Java mindset to the matrix/vector mindset of R. Your code worked very well, but I just have one problem: Essentially I have a time series of stock A, followed by a time series of stock B, etc. So there are break points in the data (the points where it switches stocks have incorrect returns, and should be NA at t=0 for each stock) Is there an easy way to account for this in R? -- View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242697.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Joshua Wiley-2
Hey Josh,
Thanks for the quick response! I guess I have to switch from the Java mindset to the matrix/vector mindset of R. Your code worked very well, but I just have one problem: Essentially I have a time series of stock A, followed by a time series of stock B, etc. So there are break points in the data (the points where it switches stocks have incorrect returns, and should be NA at t=0 for each stock) Is there an easy way to account for this in R? What I was thinking of is if there is a way to make a filter rule. Such as if the ID of the row matches Stock A, then perform this. >>"Hello Jeff, Try this: test <- data.frame(totret=rnorm(10^7)) #create some sample data test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"] The general idea is to take the column "totret" excluding the first 1, dividided by "totret" exluding the last row. This gives in effect t+1 (since t is now shorter)/t I assigned the result to a new column "dailyreturn". For 10^7 rows, it tooks 1.92 seconds on my system." |
Hey Jeff,
I have a few ideas. Each has some different requirements, and to help you choose, I bench marked them. ###START### ##Basic data > test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3), time=rep(c(1, rep(0, 999)), 10^4)) ##Option 1: probably the most general, but also the slowest by far. ##The idea is it does the calculation for each stock/ID, and then concatenates [c()] an NA in front. > system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"], test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE) user system elapsed 49.11 0.42 49.86 ##Option 2: Assumes that you have the same number of measurements for each stock/ID so you can just assign an NA every nth row. ##This is fairly fast > system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE) user system elapsed 1.11 0.21 1.31 > system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA, gcFirst=TRUE) user system elapsed 0.39 0.04 0.42 ##Option 3: Assumes that you have some variable (time in my little test data) that somehow indicates when each stock/ID has its first measurement. In the example, the first measurement gets a 1 and subsequent ones a 0. So we just assign NA in 'dailyreturns' everytime the other "time" column has a 1. Again, a big assumption, but fairly quick. > system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE) user system elapsed 1.06 0.17 1.25 > system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA, gcFirst=TRUE) user system elapsed 0.46 0.09 0.55 ###END### I really feel like there should be a faster way that is also more general, but it is late and I am not coming up with any better ideas at the moment. Perhaps somehow finding the first instance of a stock/ID? Anyway, this was simulated on 10 million rows, so maybe by() works plenty fast for you. Josh On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <[hidden email]> wrote: > > Hey Josh, > > Thanks for the quick response! > > I guess I have to switch from the Java mindset to the matrix/vector mindset > of R. > > Your code worked very well, but I just have one problem: > > Essentially I have a time series of stock A, followed by a time series of > stock B, etc. > So there are break points in the data (the points where it switches stocks > have incorrect returns, and should be NA at t=0 for each stock) > > Is there an easy way to account for this in R? What I was thinking of is if > there is a way to make a filter rule. Such as if the ID of the row matches > Stock A, then perform this. > >>>"Hello Jeff, > > Try this: > > test <- data.frame(totret=rnorm(10^7)) #create some sample data > test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"] > > The general idea is to take the column "totret" excluding the first 1, > dividided by "totret" exluding the last row. This gives in effect t+1 > (since t is now shorter)/t > > I assigned the result to a new column "dailyreturn". For 10^7 rows, > it tooks 1.92 seconds on my system." > -- > View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Joshua Wiley Senior in Psychology University of California, Riverside http://www.joshuawiley.com/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
I am not exactly sure how your filtering code is working, but take a look at
?na.omit You will probably need a few additional steps if you want to remove all rows related to a particular id. Also look at ?subset which is a good general way to subset your data. Josh On Thu, Jun 3, 2010 at 11:45 PM, Jeffery Ding <[hidden email]> wrote: > Thanks, you have been tremendously helpful! > I will be able to implement option 2, after I filter out stocks with > incomplete data sets. > > So far, for my filtering code I have: > > ##Filtering > > x<-length(unique(Returns$date_)) > y<-unique(Returns$id) > Returns.filter<-Returns > > i<-1 > > while(i<=length(y)) { > a<-sum(Returns$id==y[i]) > if(a<x) { > ##need code that will remove all rows with id a > } > i<-i+1 > } > > > > On Fri, Jun 4, 2010 at 2:40 PM, Joshua Wiley <[hidden email]> wrote: >> >> Hey Jeff, >> >> I have a few ideas. Each has some different requirements, and to help >> you choose, I bench marked them. >> >> >> ###START### >> >> ##Basic data >> > test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3), >> > time=rep(c(1, rep(0, 999)), 10^4)) >> >> ##Option 1: probably the most general, but also the slowest by far. >> ##The idea is it does the calculation for each stock/ID, and then >> concatenates [c()] an NA in front. >> >> > system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"], >> > test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE) >> user system elapsed >> 49.11 0.42 49.86 >> >> ##Option 2: Assumes that you have the same number of measurements for >> each stock/ID so you can just assign an NA every nth row. >> ##This is fairly fast >> >> > system.time(test[-1,"dailyreturns"] <- >> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE) >> user system elapsed >> 1.11 0.21 1.31 >> > system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA, >> > gcFirst=TRUE) >> user system elapsed >> 0.39 0.04 0.42 >> >> ##Option 3: Assumes that you have some variable (time in my little >> test data) that somehow indicates when each stock/ID has its first >> measurement. In the example, the first measurement gets a 1 and >> subsequent ones a 0. So we just assign NA in 'dailyreturns' everytime >> the other "time" column has a 1. Again, a big assumption, but fairly >> quick. >> >> > system.time(test[-1,"dailyreturns"] <- >> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE) >> user system elapsed >> 1.06 0.17 1.25 >> > system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA, >> > gcFirst=TRUE) >> user system elapsed >> 0.46 0.09 0.55 >> >> ###END### >> >> I really feel like there should be a faster way that is also more >> general, but it is late and I am not coming up with any better ideas >> at the moment. Perhaps somehow finding the first instance of a >> stock/ID? Anyway, this was simulated on 10 million rows, so maybe >> by() works plenty fast for you. >> >> Josh >> >> >> On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <[hidden email]> wrote: >> > >> > Hey Josh, >> > >> > Thanks for the quick response! >> > >> > I guess I have to switch from the Java mindset to the matrix/vector >> > mindset >> > of R. >> > >> > Your code worked very well, but I just have one problem: >> > >> > Essentially I have a time series of stock A, followed by a time series >> > of >> > stock B, etc. >> > So there are break points in the data (the points where it switches >> > stocks >> > have incorrect returns, and should be NA at t=0 for each stock) >> > >> > Is there an easy way to account for this in R? What I was thinking of is >> > if >> > there is a way to make a filter rule. Such as if the ID of the row >> > matches >> > Stock A, then perform this. >> > >> >>>"Hello Jeff, >> > >> > Try this: >> > >> > test <- data.frame(totret=rnorm(10^7)) #create some sample data >> > test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"] >> > >> > The general idea is to take the column "totret" excluding the first 1, >> > dividided by "totret" exluding the last row. This gives in effect t+1 >> > (since t is now shorter)/t >> > >> > I assigned the result to a new column "dailyreturn". For 10^7 rows, >> > it tooks 1.92 seconds on my system." >> > -- >> > View this message in context: >> > http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html >> > Sent from the R help mailing list archive at Nabble.com. >> > >> > ______________________________________________ >> > [hidden email] mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> >> >> -- >> Joshua Wiley >> Senior in Psychology >> University of California, Riverside >> http://www.joshuawiley.com/ > > > > -- > Jeffery Ding > Duke University, Class of 2012 > (224) 622-3398 | [hidden email] > -- Joshua Wiley Senior in Psychology University of California, Riverside http://www.joshuawiley.com/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Jeff08
Hi:
The key phrase in your mail was 'data.table'. Given the size of the object, it is very likely to be a data.table, which (oddly enough) comes from package data.table. It is designed to quickly process information in very large datasets. 3M rows is an 'average' sized data.table :) Your request isn't very sophisticated - it appears this function should work groupwise (group = ID): dret <- function(x) c(100.00, 100 * x[-1]/x[-length(x)]) This function can be processed in data.table or ddply (package plyr) groupwise without much difficulty. I'm going to assume that the data are ordered in time for simplicity. I'm also using 100 for the first entry in the function - if you want, you can change the initial 100.00 to NA. Let's generate a little fake data: id <- as.character(rep(c(427225, 290157, 394025, 382940), each = 1000)) times <- rep(seq(as.Date('2001-11-13'), by = 'days', length = 1000), 4) totret <- c(rnorm(1000, 20, 0.1), rnorm(1000, 25, 0.1), rnorm(1000, 30, 0.1), rnorm(1000, 35, 0.1)) # data frame: DF <- data.frame(id = id, times = times, totret = totret) # data table: library(data.table) DR <- data.table(DF) # data.table sets up id as the table's primary key - note that the storage mode # of the key has to be integer. tables() # see what we've got # set id as the table key, do the calculation by group and tack the result onto DR system.time({ setkey(DR, id); DR2 <- DR[, dret(totret), by = id]; DR$return <- DR2$V1 }) user system elapsed 0 0 0 library(plyr) system.time(df2 <- ddply(DF, .(id), transform, return = dret(totret))) user system elapsed 0.03 0.00 0.05 The difference between the two is this. The data.table calculation returns a data.table DR2 with the key and the returns, after which we add the column of returns to the original data table DR. In contrast, the ddply calculation tacks on the column of returns to the original data frame as a result of transform. Notice that in the data.table code, we set the table key (which is often the most time consuming task, since it orders the data by the values in its key), did the calculation and tacked the result onto the original table almost instantaneously. According to the data.table package author, the time savings in using data.table scales upward as the size of the table increases - in other words, the bigger the table, the faster data.table will be relative to other processing methods currently available in R. You can see that there is a noticeable time difference at n = 4000, so the difference at n = 3M will be more dramatic. Development work in plyr is showing that the gap between it and data.table is narrowing, but both packages are in active development, so R users can look forward to two very powerful packages for summarizing, transforming and condensing data. I would suggest that you read the vignette and FAQ from data.table (available from the on-line data.table help page) and the documentation of plyr at its author's web site: http://had.co.nz/plyr/ There is a tutorial with slides and a full-scale document. HTH, Dennis On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <[hidden email]> wrote: > > Hello Everyone, > > I just started a new job & it requires heavy use of R to analyze datasets. > > I have a data.table that looks like this. It is sorted by ID & Date, there > are about 150 different IDs & the dataset spans 3 million rows. The main > columns of concern are ID, date, and totret. What I need to do is to derive > daily returns for each ID from totret, which is simply totret at time t+1 > divided by totret at time t. > > X id ticker date_ adjClose totret RankStk > 427225 427225 00174410 AHS 2001-11-13 21.66 100.00000 1235 > 441910 441910 00174410 AHS 2001-11-14 21.60 99.72300 1235 > 458458 458458 00174410 AHS 2001-11-15 21.65 99.95380 1235 > 284003 284003 00174410 AHS 2001-11-16 21.59 99.67680 1235 > > Two problems for me: > > 1)I can't just apply it to the entire column since there will be problems > at > the boundary points where the ID changes from 1 to another. I need to find > out how to specify a restriction on the name of the ID > > 2) From Java, instinctively I would use a loop to calculate daily returns, > but I found out that R is very slow with loops, so I need to find an > efficient way to calculate daily returns with such a huge dataset. > > Thanks a lot! > > > -- > View this message in context: > http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |