

Hello Everyone,
I just started a new job & it requires heavy use of R to analyze datasets.
I have a data.table that looks like this. It is sorted by ID & Date, there are about 150 different IDs & the dataset spans 3 million rows. The main columns of concern are ID, date, and totret. What I need to do is to derive daily returns for each ID from totret, which is simply totret at time t+1 divided by totret at time t.
X id ticker date_ adjClose totret RankStk
427225 427225 00174410 AHS 20011113 21.66 100.00000 1235
441910 441910 00174410 AHS 20011114 21.60 99.72300 1235
458458 458458 00174410 AHS 20011115 21.65 99.95380 1235
284003 284003 00174410 AHS 20011116 21.59 99.67680 1235
Two problems for me:
1)I can't just apply it to the entire column since there will be problems at the boundary points where the ID changes from 1 to another. I need to find out how to specify a restriction on the name of the ID
2) From Java, instinctively I would use a loop to calculate daily returns, but I found out that R is very slow with loops, so I need to find an efficient way to calculate daily returns with such a huge dataset.
Thanks a lot!


Hello Jeff,
Try this:
test < data.frame(totret=rnorm(10^7)) #create some sample data
test[1,"dailyreturn"] < test[1,"totret"]/test[nrow(test),"totret"]
The general idea is to take the column "totret" excluding the first 1,
dividided by "totret" exluding the last row. This gives in effect t+1
(since t is now shorter)/t
I assigned the result to a new column "dailyreturn". For 10^7 rows,
it tooks 1.92 seconds on my system.
HTH,
Josh
On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 < [hidden email]> wrote:
>
> Hello Everyone,
>
> I just started a new job & it requires heavy use of R to analyze datasets.
>
> I have a data.table that looks like this. It is sorted by ID & Date, there
> are about 150 different IDs & the dataset spans 3 million rows. The main
> columns of concern are ID, date, and totret. What I need to do is to derive
> daily returns for each ID from totret, which is simply totret at time t+1
> divided by totret at time t.
>
> X id ticker date_ adjClose totret RankStk
> 427225 427225 00174410 AHS 20011113 21.66 100.00000 1235
> 441910 441910 00174410 AHS 20011114 21.60 99.72300 1235
> 458458 458458 00174410 AHS 20011115 21.65 99.95380 1235
> 284003 284003 00174410 AHS 20011116 21.59 99.67680 1235
>
> Two problems for me:
>
> 1)I can't just apply it to the entire column since there will be problems at
> the boundary points where the ID changes from 1 to another. I need to find
> out how to specify a restriction on the name of the ID
>
> 2) From Java, instinctively I would use a loop to calculate daily returns,
> but I found out that R is very slow with loops, so I need to find an
> efficient way to calculate daily returns with such a huge dataset.
>
> Thanks a lot!
>
>
> 
> View this message in context: http://r.789695.n4.nabble.com/RNewbiepleasehelptp2242633p2242633.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>

Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hey Josh,
Thanks for the quick response!
I guess I have to switch from the Java mindset to the matrix/vector mindset
of R.
Your code worked very well, but I just have one problem:
Essentially I have a time series of stock A, followed by a time series of
stock B, etc.
So there are break points in the data (the points where it switches stocks
have incorrect returns, and should be NA at t=0 for each stock)
Is there an easy way to account for this in R?

View this message in context: http://r.789695.n4.nabble.com/RNewbiepleasehelptp2242633p2242697.htmlSent from the R help mailing list archive at Nabble.com.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hey Josh,
Thanks for the quick response!
I guess I have to switch from the Java mindset to the matrix/vector mindset of R.
Your code worked very well, but I just have one problem:
Essentially I have a time series of stock A, followed by a time series of stock B, etc.
So there are break points in the data (the points where it switches stocks have incorrect returns, and should be NA at t=0 for each stock)
Is there an easy way to account for this in R? What I was thinking of is if there is a way to make a filter rule. Such as if the ID of the row matches Stock A, then perform this.
>>"Hello Jeff,
Try this:
test < data.frame(totret=rnorm(10^7)) #create some sample data
test[1,"dailyreturn"] < test[1,"totret"]/test[nrow(test),"totret"]
The general idea is to take the column "totret" excluding the first 1,
dividided by "totret" exluding the last row. This gives in effect t+1
(since t is now shorter)/t
I assigned the result to a new column "dailyreturn". For 10^7 rows,
it tooks 1.92 seconds on my system."


Hey Jeff,
I have a few ideas. Each has some different requirements, and to help
you choose, I bench marked them.
###START###
##Basic data
> test < data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3), time=rep(c(1, rep(0, 999)), 10^4))
##Option 1: probably the most general, but also the slowest by far.
##The idea is it does the calculation for each stock/ID, and then
concatenates [c()] an NA in front.
> system.time(test[,"dailyreturns"] < unlist(by(test[,"totret"], test[,"id"], function(x) {c(NA, x[1]/x[length(x)])})), gcFirst=TRUE)
user system elapsed
49.11 0.42 49.86
##Option 2: Assumes that you have the same number of measurements for
each stock/ID so you can just assign an NA every nth row.
##This is fairly fast
> system.time(test[1,"dailyreturns"] < test[1,"totret"]/test[nrow(test),"totret"], gcFirst=TRUE)
user system elapsed
1.11 0.21 1.31
> system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] < NA, gcFirst=TRUE)
user system elapsed
0.39 0.04 0.42
##Option 3: Assumes that you have some variable (time in my little
test data) that somehow indicates when each stock/ID has its first
measurement. In the example, the first measurement gets a 1 and
subsequent ones a 0. So we just assign NA in 'dailyreturns' everytime
the other "time" column has a 1. Again, a big assumption, but fairly
quick.
> system.time(test[1,"dailyreturns"] < test[1,"totret"]/test[nrow(test),"totret"], gcFirst=TRUE)
user system elapsed
1.06 0.17 1.25
> system.time(test[which(test[,"time"]==1),"dailyreturns"] < NA, gcFirst=TRUE)
user system elapsed
0.46 0.09 0.55
###END###
I really feel like there should be a faster way that is also more
general, but it is late and I am not coming up with any better ideas
at the moment. Perhaps somehow finding the first instance of a
stock/ID? Anyway, this was simulated on 10 million rows, so maybe
by() works plenty fast for you.
Josh
On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 < [hidden email]> wrote:
>
> Hey Josh,
>
> Thanks for the quick response!
>
> I guess I have to switch from the Java mindset to the matrix/vector mindset
> of R.
>
> Your code worked very well, but I just have one problem:
>
> Essentially I have a time series of stock A, followed by a time series of
> stock B, etc.
> So there are break points in the data (the points where it switches stocks
> have incorrect returns, and should be NA at t=0 for each stock)
>
> Is there an easy way to account for this in R? What I was thinking of is if
> there is a way to make a filter rule. Such as if the ID of the row matches
> Stock A, then perform this.
>
>>>"Hello Jeff,
>
> Try this:
>
> test < data.frame(totret=rnorm(10^7)) #create some sample data
> test[1,"dailyreturn"] < test[1,"totret"]/test[nrow(test),"totret"]
>
> The general idea is to take the column "totret" excluding the first 1,
> dividided by "totret" exluding the last row. This gives in effect t+1
> (since t is now shorter)/t
>
> I assigned the result to a new column "dailyreturn". For 10^7 rows,
> it tooks 1.92 seconds on my system."
> 
> View this message in context: http://r.789695.n4.nabble.com/RNewbiepleasehelptp2242633p2242703.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>

Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I am not exactly sure how your filtering code is working, but take a look at
?na.omit
You will probably need a few additional steps if you want to remove
all rows related to a particular id. Also look at ?subset which is a
good general way to subset your data.
Josh
On Thu, Jun 3, 2010 at 11:45 PM, Jeffery Ding < [hidden email]> wrote:
> Thanks, you have been tremendously helpful!
> I will be able to implement option 2, after I filter out stocks with
> incomplete data sets.
>
> So far, for my filtering code I have:
>
> ##Filtering
>
> x<length(unique(Returns$date_))
> y<unique(Returns$id)
> Returns.filter<Returns
>
> i<1
>
> while(i<=length(y)) {
> a<sum(Returns$id==y[i])
> if(a<x) {
> ##need code that will remove all rows with id a
> }
> i<i+1
> }
>
>
>
> On Fri, Jun 4, 2010 at 2:40 PM, Joshua Wiley < [hidden email]> wrote:
>>
>> Hey Jeff,
>>
>> I have a few ideas. Each has some different requirements, and to help
>> you choose, I bench marked them.
>>
>>
>> ###START###
>>
>> ##Basic data
>> > test < data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3),
>> > time=rep(c(1, rep(0, 999)), 10^4))
>>
>> ##Option 1: probably the most general, but also the slowest by far.
>> ##The idea is it does the calculation for each stock/ID, and then
>> concatenates [c()] an NA in front.
>>
>> > system.time(test[,"dailyreturns"] < unlist(by(test[,"totret"],
>> > test[,"id"], function(x) {c(NA, x[1]/x[length(x)])})), gcFirst=TRUE)
>> user system elapsed
>> 49.11 0.42 49.86
>>
>> ##Option 2: Assumes that you have the same number of measurements for
>> each stock/ID so you can just assign an NA every nth row.
>> ##This is fairly fast
>>
>> > system.time(test[1,"dailyreturns"] <
>> > test[1,"totret"]/test[nrow(test),"totret"], gcFirst=TRUE)
>> user system elapsed
>> 1.11 0.21 1.31
>> > system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] < NA,
>> > gcFirst=TRUE)
>> user system elapsed
>> 0.39 0.04 0.42
>>
>> ##Option 3: Assumes that you have some variable (time in my little
>> test data) that somehow indicates when each stock/ID has its first
>> measurement. In the example, the first measurement gets a 1 and
>> subsequent ones a 0. So we just assign NA in 'dailyreturns' everytime
>> the other "time" column has a 1. Again, a big assumption, but fairly
>> quick.
>>
>> > system.time(test[1,"dailyreturns"] <
>> > test[1,"totret"]/test[nrow(test),"totret"], gcFirst=TRUE)
>> user system elapsed
>> 1.06 0.17 1.25
>> > system.time(test[which(test[,"time"]==1),"dailyreturns"] < NA,
>> > gcFirst=TRUE)
>> user system elapsed
>> 0.46 0.09 0.55
>>
>> ###END###
>>
>> I really feel like there should be a faster way that is also more
>> general, but it is late and I am not coming up with any better ideas
>> at the moment. Perhaps somehow finding the first instance of a
>> stock/ID? Anyway, this was simulated on 10 million rows, so maybe
>> by() works plenty fast for you.
>>
>> Josh
>>
>>
>> On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 < [hidden email]> wrote:
>> >
>> > Hey Josh,
>> >
>> > Thanks for the quick response!
>> >
>> > I guess I have to switch from the Java mindset to the matrix/vector
>> > mindset
>> > of R.
>> >
>> > Your code worked very well, but I just have one problem:
>> >
>> > Essentially I have a time series of stock A, followed by a time series
>> > of
>> > stock B, etc.
>> > So there are break points in the data (the points where it switches
>> > stocks
>> > have incorrect returns, and should be NA at t=0 for each stock)
>> >
>> > Is there an easy way to account for this in R? What I was thinking of is
>> > if
>> > there is a way to make a filter rule. Such as if the ID of the row
>> > matches
>> > Stock A, then perform this.
>> >
>> >>>"Hello Jeff,
>> >
>> > Try this:
>> >
>> > test < data.frame(totret=rnorm(10^7)) #create some sample data
>> > test[1,"dailyreturn"] < test[1,"totret"]/test[nrow(test),"totret"]
>> >
>> > The general idea is to take the column "totret" excluding the first 1,
>> > dividided by "totret" exluding the last row. This gives in effect t+1
>> > (since t is now shorter)/t
>> >
>> > I assigned the result to a new column "dailyreturn". For 10^7 rows,
>> > it tooks 1.92 seconds on my system."
>> > 
>> > View this message in context:
>> > http://r.789695.n4.nabble.com/RNewbiepleasehelptp2242633p2242703.html>> > Sent from the R help mailing list archive at Nabble.com.
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/rhelp>> > PLEASE do read the posting guide
>> > http://www.Rproject.org/postingguide.html>> > and provide commented, minimal, selfcontained, reproducible code.
>> >
>>
>>
>>
>> 
>> Joshua Wiley
>> Senior in Psychology
>> University of California, Riverside
>> http://www.joshuawiley.com/>
>
>
> 
> Jeffery Ding
> Duke University, Class of 2012
> (224) 6223398  [hidden email]
>

Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi:
The key phrase in your mail was 'data.table'. Given the size of the object,
it is very likely to be a data.table, which (oddly enough) comes from
package
data.table. It is designed to quickly process information in very large
datasets. 3M rows is an 'average' sized data.table :)
Your request isn't very sophisticated  it appears this function should
work groupwise (group = ID):
dret < function(x) c(100.00, 100 * x[1]/x[length(x)])
This function can be processed in data.table or ddply (package plyr)
groupwise without much difficulty. I'm going to assume that the data are
ordered in time for simplicity. I'm also using 100 for the first entry in
the
function  if you want, you can change the initial 100.00 to NA.
Let's generate a little fake data:
id < as.character(rep(c(427225, 290157, 394025, 382940), each = 1000))
times < rep(seq(as.Date('20011113'), by = 'days', length = 1000), 4)
totret < c(rnorm(1000, 20, 0.1), rnorm(1000, 25, 0.1), rnorm(1000, 30,
0.1),
rnorm(1000, 35, 0.1))
# data frame:
DF < data.frame(id = id, times = times, totret = totret)
# data table:
library(data.table)
DR < data.table(DF)
# data.table sets up id as the table's primary key  note that the storage
mode
# of the key has to be integer.
tables() # see what we've got
# set id as the table key, do the calculation by group and tack the result
onto DR
system.time({ setkey(DR, id); DR2 < DR[, dret(totret), by = id];
DR$return < DR2$V1 })
user system elapsed
0 0 0
library(plyr)
system.time(df2 < ddply(DF, .(id), transform, return = dret(totret)))
user system elapsed
0.03 0.00 0.05
The difference between the two is this. The data.table calculation returns a
data.table DR2 with the key and the returns, after which we add the column
of returns to the original data table DR. In contrast, the ddply calculation
tacks
on the column of returns to the original data frame as a result of
transform.
Notice that in the data.table code, we set the table key (which is often the
most time consuming task, since it orders the data by the values in its
key),
did the calculation and tacked the result onto the original table almost
instantaneously.
According to the data.table package author, the time savings in using
data.table scales upward as the size of the table increases  in other
words,
the bigger the table, the faster data.table will be relative to other
processing
methods currently available in R. You can see that there is a noticeable
time
difference at n = 4000, so the difference at n = 3M will be more dramatic.
Development work in plyr is showing that the gap between it and data.table
is narrowing, but both packages are in active development, so R users can
look forward to two very powerful packages for summarizing, transforming
and condensing data.
I would suggest that you read the vignette and FAQ from data.table
(available
from the online data.table help page) and the documentation of plyr
at its author's web site:
http://had.co.nz/plyr/There is a tutorial with slides and a fullscale document.
HTH,
Dennis
On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 < [hidden email]> wrote:
>
> Hello Everyone,
>
> I just started a new job & it requires heavy use of R to analyze datasets.
>
> I have a data.table that looks like this. It is sorted by ID & Date, there
> are about 150 different IDs & the dataset spans 3 million rows. The main
> columns of concern are ID, date, and totret. What I need to do is to derive
> daily returns for each ID from totret, which is simply totret at time t+1
> divided by totret at time t.
>
> X id ticker date_ adjClose totret RankStk
> 427225 427225 00174410 AHS 20011113 21.66 100.00000 1235
> 441910 441910 00174410 AHS 20011114 21.60 99.72300 1235
> 458458 458458 00174410 AHS 20011115 21.65 99.95380 1235
> 284003 284003 00174410 AHS 20011116 21.59 99.67680 1235
>
> Two problems for me:
>
> 1)I can't just apply it to the entire column since there will be problems
> at
> the boundary points where the ID changes from 1 to another. I need to find
> out how to specify a restriction on the name of the ID
>
> 2) From Java, instinctively I would use a loop to calculate daily returns,
> but I found out that R is very slow with loops, so I need to find an
> efficient way to calculate daily returns with such a huge dataset.
>
> Thanks a lot!
>
>
> 
> View this message in context:
> http://r.789695.n4.nabble.com/RNewbiepleasehelptp2242633p2242633.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

