R Newbie, please help!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

R Newbie, please help!

Jeff08
Hello Everyone,

I just started a new job & it requires heavy use of R to analyze datasets.

I have a data.table that looks like this. It is sorted by ID & Date, there are about 150 different IDs & the dataset spans 3 million rows. The main columns of concern are ID, date, and totret. What I need to do is to derive daily returns for each ID from totret, which is simply totret at time t+1 divided by totret at time t.

              X       id ticker      date_ adjClose    totret RankStk
427225   427225 00174410    AHS 2001-11-13    21.66 100.00000    1235
441910   441910 00174410    AHS 2001-11-14    21.60  99.72300    1235
458458   458458 00174410    AHS 2001-11-15    21.65  99.95380    1235
284003   284003 00174410    AHS 2001-11-16    21.59  99.67680    1235

Two problems for me:

1)I can't just apply it to the entire column since there will be problems at the boundary points where the ID changes from 1 to another. I need to find out how to specify a restriction on the name of the ID

2) From Java, instinctively I would use a loop to calculate daily returns, but I found out that R is very slow with loops, so I need to find an efficient way to calculate  daily returns with such a huge dataset.

Thanks a lot!

Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

Joshua Wiley-2
Hello Jeff,

Try this:

test <- data.frame(totret=rnorm(10^7)) #create some sample data
test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]

The general idea is to take the column "totret" excluding the first 1,
dividided by "totret" exluding the last row.  This gives in effect t+1
(since t is now shorter)/t

I assigned the result to a new column "dailyreturn".  For 10^7 rows,
it tooks 1.92 seconds on my system.

HTH,

Josh

On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <[hidden email]> wrote:

>
> Hello Everyone,
>
> I just started a new job & it requires heavy use of R to analyze datasets.
>
> I have a data.table that looks like this. It is sorted by ID & Date, there
> are about 150 different IDs & the dataset spans 3 million rows. The main
> columns of concern are ID, date, and totret. What I need to do is to derive
> daily returns for each ID from totret, which is simply totret at time t+1
> divided by totret at time t.
>
>              X       id ticker      date_ adjClose    totret RankStk
> 427225   427225 00174410    AHS 2001-11-13    21.66 100.00000    1235
> 441910   441910 00174410    AHS 2001-11-14    21.60  99.72300    1235
> 458458   458458 00174410    AHS 2001-11-15    21.65  99.95380    1235
> 284003   284003 00174410    AHS 2001-11-16    21.59  99.67680    1235
>
> Two problems for me:
>
> 1)I can't just apply it to the entire column since there will be problems at
> the boundary points where the ID changes from 1 to another. I need to find
> out how to specify a restriction on the name of the ID
>
> 2) From Java, instinctively I would use a loop to calculate daily returns,
> but I found out that R is very slow with loops, so I need to find an
> efficient way to calculate  daily returns with such a huge dataset.
>
> Thanks a lot!
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

Jeff08

Hey Josh,

Thanks for the quick response!

I guess I have to switch from the Java mindset to the matrix/vector mindset
of R.

Your code worked very well, but I just have one problem:

Essentially I have a time series of stock A, followed by a time series of
stock B, etc.
So there are break points in the data (the points where it switches stocks
have incorrect returns, and should be NA at t=0 for each stock)

Is there an easy way to account for this in R?
--
View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242697.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

Jeff08
In reply to this post by Joshua Wiley-2
Hey Josh,

Thanks for the quick response!

I guess I have to switch from the Java mindset to the matrix/vector mindset of R.

Your code worked very well, but I just have one problem:

Essentially I have a time series of stock A, followed by a time series of stock B, etc.
So there are break points in the data (the points where it switches stocks have incorrect returns, and should be NA at t=0 for each stock)

Is there an easy way to account for this in R? What I was thinking of is if there is a way to make a filter rule. Such as if the ID of the row matches Stock A, then perform this.

>>"Hello Jeff,

Try this:

test <- data.frame(totret=rnorm(10^7)) #create some sample data
test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]

The general idea is to take the column "totret" excluding the first 1,
dividided by "totret" exluding the last row.  This gives in effect t+1
(since t is now shorter)/t

I assigned the result to a new column "dailyreturn".  For 10^7 rows,
it tooks 1.92 seconds on my system."
Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

Joshua Wiley-2
Hey Jeff,

I have a few ideas.  Each has some different requirements, and to help
you choose, I bench marked them.


###START###

##Basic data
> test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3), time=rep(c(1, rep(0, 999)), 10^4))

##Option 1: probably the most general, but also the slowest by far.
##The idea is it does the calculation for each stock/ID, and then
concatenates [c()] an NA in front.

> system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"], test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE)
   user  system elapsed
  49.11    0.42   49.86

##Option 2: Assumes that you have the same number of measurements for
each stock/ID so you can just assign an NA every nth row.
##This is fairly fast

> system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
   user  system elapsed
   1.11    0.21    1.31
> system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA, gcFirst=TRUE)
   user  system elapsed
   0.39    0.04    0.42

##Option 3: Assumes that you have some variable (time in my little
test data) that somehow indicates when each stock/ID has its first
measurement.  In the example, the first measurement gets a 1 and
subsequent ones a 0.  So we just assign NA in 'dailyreturns' everytime
the other "time" column has a 1.  Again, a big assumption, but fairly
quick.

> system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
   user  system elapsed
   1.06    0.17    1.25
> system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA, gcFirst=TRUE)
   user  system elapsed
   0.46    0.09    0.55

###END###

I really feel like there should be a faster way that is also more
general, but it is late and I am not coming up with any better ideas
at the moment.  Perhaps somehow finding the first instance of a
stock/ID?  Anyway, this was simulated on 10 million rows, so maybe
by() works plenty fast for you.

Josh


On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <[hidden email]> wrote:

>
> Hey Josh,
>
> Thanks for the quick response!
>
> I guess I have to switch from the Java mindset to the matrix/vector mindset
> of R.
>
> Your code worked very well, but I just have one problem:
>
> Essentially I have a time series of stock A, followed by a time series of
> stock B, etc.
> So there are break points in the data (the points where it switches stocks
> have incorrect returns, and should be NA at t=0 for each stock)
>
> Is there an easy way to account for this in R? What I was thinking of is if
> there is a way to make a filter rule. Such as if the ID of the row matches
> Stock A, then perform this.
>
>>>"Hello Jeff,
>
> Try this:
>
> test <- data.frame(totret=rnorm(10^7)) #create some sample data
> test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]
>
> The general idea is to take the column "totret" excluding the first 1,
> dividided by "totret" exluding the last row.  This gives in effect t+1
> (since t is now shorter)/t
>
> I assigned the result to a new column "dailyreturn".  For 10^7 rows,
> it tooks 1.92 seconds on my system."
> --
> View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

Joshua Wiley-2
I am not exactly sure how your filtering code is working, but take a look at

?na.omit

You will probably need a few additional steps if you want to remove
all rows related to a particular id.  Also look at ?subset which is a
good general way to subset your data.

Josh



On Thu, Jun 3, 2010 at 11:45 PM, Jeffery Ding <[hidden email]> wrote:

> Thanks, you have been tremendously helpful!
> I will be able to implement option 2, after I filter out stocks with
> incomplete data sets.
>
> So far, for my filtering code I have:
>
> ##Filtering
>
> x<-length(unique(Returns$date_))
> y<-unique(Returns$id)
> Returns.filter<-Returns
>
> i<-1
>
> while(i<=length(y)) {
>     a<-sum(Returns$id==y[i])
>     if(a<x) {
>         ##need code that will remove all rows with id a
>     }
>     i<-i+1
>     }
>
>
>
> On Fri, Jun 4, 2010 at 2:40 PM, Joshua Wiley <[hidden email]> wrote:
>>
>> Hey Jeff,
>>
>> I have a few ideas.  Each has some different requirements, and to help
>> you choose, I bench marked them.
>>
>>
>> ###START###
>>
>> ##Basic data
>> > test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3),
>> > time=rep(c(1, rep(0, 999)), 10^4))
>>
>> ##Option 1: probably the most general, but also the slowest by far.
>> ##The idea is it does the calculation for each stock/ID, and then
>> concatenates [c()] an NA in front.
>>
>> > system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"],
>> > test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE)
>>   user  system elapsed
>>  49.11    0.42   49.86
>>
>> ##Option 2: Assumes that you have the same number of measurements for
>> each stock/ID so you can just assign an NA every nth row.
>> ##This is fairly fast
>>
>> > system.time(test[-1,"dailyreturns"] <-
>> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
>>   user  system elapsed
>>   1.11    0.21    1.31
>> > system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA,
>> > gcFirst=TRUE)
>>   user  system elapsed
>>   0.39    0.04    0.42
>>
>> ##Option 3: Assumes that you have some variable (time in my little
>> test data) that somehow indicates when each stock/ID has its first
>> measurement.  In the example, the first measurement gets a 1 and
>> subsequent ones a 0.  So we just assign NA in 'dailyreturns' everytime
>> the other "time" column has a 1.  Again, a big assumption, but fairly
>> quick.
>>
>> > system.time(test[-1,"dailyreturns"] <-
>> > test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
>>   user  system elapsed
>>   1.06    0.17    1.25
>> > system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA,
>> > gcFirst=TRUE)
>>   user  system elapsed
>>   0.46    0.09    0.55
>>
>> ###END###
>>
>> I really feel like there should be a faster way that is also more
>> general, but it is late and I am not coming up with any better ideas
>> at the moment.  Perhaps somehow finding the first instance of a
>> stock/ID?  Anyway, this was simulated on 10 million rows, so maybe
>> by() works plenty fast for you.
>>
>> Josh
>>
>>
>> On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <[hidden email]> wrote:
>> >
>> > Hey Josh,
>> >
>> > Thanks for the quick response!
>> >
>> > I guess I have to switch from the Java mindset to the matrix/vector
>> > mindset
>> > of R.
>> >
>> > Your code worked very well, but I just have one problem:
>> >
>> > Essentially I have a time series of stock A, followed by a time series
>> > of
>> > stock B, etc.
>> > So there are break points in the data (the points where it switches
>> > stocks
>> > have incorrect returns, and should be NA at t=0 for each stock)
>> >
>> > Is there an easy way to account for this in R? What I was thinking of is
>> > if
>> > there is a way to make a filter rule. Such as if the ID of the row
>> > matches
>> > Stock A, then perform this.
>> >
>> >>>"Hello Jeff,
>> >
>> > Try this:
>> >
>> > test <- data.frame(totret=rnorm(10^7)) #create some sample data
>> > test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]
>> >
>> > The general idea is to take the column "totret" excluding the first 1,
>> > dividided by "totret" exluding the last row.  This gives in effect t+1
>> > (since t is now shorter)/t
>> >
>> > I assigned the result to a new column "dailyreturn".  For 10^7 rows,
>> > it tooks 1.92 seconds on my system."
>> > --
>> > View this message in context:
>> > http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html
>> > Sent from the R help mailing list archive at Nabble.com.
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Senior in Psychology
>> University of California, Riverside
>> http://www.joshuawiley.com/
>
>
>
> --
> Jeffery Ding
> Duke University, Class of 2012
> (224) 622-3398 | [hidden email]
>



--
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R Newbie, please help!

djmuseR
In reply to this post by Jeff08
Hi:

The key phrase in your mail was 'data.table'. Given the size of the object,
it is very likely to be a data.table, which (oddly enough) comes from
package
data.table. It is designed to quickly process information in very large
datasets. 3M rows is an 'average' sized data.table :)

Your request isn't very sophisticated - it appears this function should
work groupwise (group = ID):

dret <- function(x) c(100.00, 100 * x[-1]/x[-length(x)])

This function can be processed in data.table or ddply (package plyr)
groupwise without much difficulty. I'm going to assume that the data are
ordered in time for simplicity. I'm also using 100 for the first entry in
the
function - if you want, you can change the initial 100.00 to NA.

Let's generate a little fake data:
id <- as.character(rep(c(427225, 290157, 394025, 382940), each = 1000))
times <- rep(seq(as.Date('2001-11-13'), by = 'days', length = 1000), 4)
totret <- c(rnorm(1000, 20, 0.1), rnorm(1000, 25, 0.1), rnorm(1000, 30,
0.1),
                 rnorm(1000, 35, 0.1))

# data frame:
DF <- data.frame(id = id, times = times, totret = totret)

# data table:
library(data.table)
DR <- data.table(DF)

# data.table sets up id as the table's primary key - note that the storage
mode
# of the key has to be integer.
tables()   # see what we've got

# set id as the table key, do the calculation by group and tack the result
onto DR
system.time({ setkey(DR, id); DR2 <- DR[, dret(totret), by = id];
                      DR$return <- DR2$V1 })
   user  system elapsed
      0       0       0

library(plyr)
system.time(df2 <- ddply(DF, .(id), transform, return = dret(totret)))
   user  system elapsed
   0.03    0.00    0.05

The difference between the two is this. The data.table calculation returns a
data.table DR2 with the key and the returns, after which we add the column
of returns to the original data table DR. In contrast, the ddply calculation
tacks
on the column of returns to the original data frame as a result of
transform.
Notice that in the data.table code, we set the table key (which is often the
most time consuming task, since it orders the data by the values in its
key),
did the calculation and tacked the result onto the original table almost
instantaneously.

According to the data.table package author, the time savings in using
data.table scales upward as the size of the table increases - in other
words,
the bigger the table, the faster data.table will be relative to other
processing
methods currently available in R. You can see that there is a noticeable
time
difference at n = 4000, so the difference at n = 3M will be more dramatic.
Development work in plyr is showing that the gap between it and data.table
is narrowing, but both packages are in active development, so R users can
look forward to two very powerful packages for summarizing, transforming
and condensing data.

I would suggest that you read the vignette and FAQ from data.table
(available
from the on-line data.table help page) and the documentation of plyr
at its author's web site:
http://had.co.nz/plyr/
There is a tutorial with slides and a full-scale document.

HTH,
Dennis


On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <[hidden email]> wrote:

>
> Hello Everyone,
>
> I just started a new job & it requires heavy use of R to analyze datasets.
>
> I have a data.table that looks like this. It is sorted by ID & Date, there
> are about 150 different IDs & the dataset spans 3 million rows. The main
> columns of concern are ID, date, and totret. What I need to do is to derive
> daily returns for each ID from totret, which is simply totret at time t+1
> divided by totret at time t.
>
>              X       id ticker      date_ adjClose    totret RankStk
> 427225   427225 00174410    AHS 2001-11-13    21.66 100.00000    1235
> 441910   441910 00174410    AHS 2001-11-14    21.60  99.72300    1235
> 458458   458458 00174410    AHS 2001-11-15    21.65  99.95380    1235
> 284003   284003 00174410    AHS 2001-11-16    21.59  99.67680    1235
>
> Two problems for me:
>
> 1)I can't just apply it to the entire column since there will be problems
> at
> the boundary points where the ID changes from 1 to another. I need to find
> out how to specify a restriction on the name of the ID
>
> 2) From Java, instinctively I would use a loop to calculate daily returns,
> but I found out that R is very slow with loops, so I need to find an
> efficient way to calculate  daily returns with such a huge dataset.
>
> Thanks a lot!
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.