index instead of loop?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

index instead of loop?

Ben quant
Hello,

Does anyone know of a way I can speed this up? Basically I'm attempting to
get the data item on the same row as the report date for each report date
available. In reality, I have over 11k of columns, not just A, B, C, D and
I have to do that over 100 times. My solution is slow, but it works. The
loop is slow because of merge.

# create sample data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 =
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################# change anything below.

# My first attempt at this
fix = function(x)
{
  year = substring(x, 1, 4);
  mo = substring(x, 5, 6);
  day = substring(x, 7, 8);
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
"day")
dt = sapply(dt1, as.character)

fin = dt
ck_rows = length(dt)
bad = character(0)
start_t_all = Sys.time()
for(cn in 1:ncol(ua)){
  uac = ua[,cn]
  tkr = colnames(ua)[cn]
  rdc = rd[,cn]
  ua_rd = cbind(uac,rdc)
  colnames(ua_rd) = c(tkr,'rt_date')
  xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
  xx = as.character(xx1[,2])
  values <- c(NA, xx[!is.na(xx)])
  ind = cumsum(!is.na(xx)) + 1
  y <- values[ind]
  if(ck_rows == length(y)){
    fin  = data.frame(fin,y)
  }else{
    bad = c(bad,tkr)
  }
}

colnames(fin) = c('daily_dates',nms)

print("over all time for loop")
print(Sys.time()-start_t_all)

print(fin)


Thanks,

Ben

PS - the real/over-all issue is below, but it is probably too involved to
follow.

On Sat, Mar 3, 2012 at 2:30 PM, Ben quant <[hidden email]> wrote:

> Hello,
>
> Thank you for your help/advice!
>
> The issue here is speed/efficiency. I can do what I want, but its really
> slow.
>
> The goal is to have the ability to do calculations on my data and have it
> adjusted for look-ahead. I see two ways to do this:
> (I'm open to more ideas. My terminology: Unadjusted = values not adjusted
> for look-ahead bias; adjusted = values adjusted for look-ahead bias.)
>
> 1) I could a) do calculations on unadjusted values then b) adjust the
> resulting values for look-ahead bias. Here is what I mean:
>  a) I could say the following using time series of val1: [(val1 - val1 4
> periods ago) / val1 4 periods ago] = resultval. ("Periods" correspond to
> the z.dates in my example below.)
> b) Then I would adjust the resultval for look-ahead based on val1's
> associated report date.
> Note: I don't think this will be the fastest.
>
> 2) I could do the same calculation [(val1 - val1 4 periods ago) / val1 4
> periods ago] = resultval, but my calculation function would get the 'right'
> values that would have no look-ahead bias. I'm not sure how I would do
> this, but maybe a query starting with the date that I want, indexed to
> appropriate report date indexed to the correct value to return. But how do
> I do this in R? I think I would have to put this in our database and do a
> query. The data comes to me in RData format. I could put it all in our
> database via PpgSQL which we already use.
> Note: I think this will be fastest.
>
> Anyway, my first attempt at this was to solve part b of #1 above. Here is
> how my data looks and my first attempt at solving part b of idea #1 above.
> It only takes 0.14 seconds for my mock data, but that is way too slow. The
> major things slowing it down A) the loop, B) the merge statement.
>
> # mock data: this is how it comes to me (raw)
> # in practice I have over 10,000 columns
>
> # the starting 'periods' for my data
> z.dates =
> c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
>
> nms = c("A","B","C","D")
> # these are the report dates that are the real days the data was available
> rd1 =
> matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",
>
> "20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
>               "20070426","--","--","--","--","--","--","20090319",
>               "--","--","--","--","--","--","--","--"),
>             nrow=8,ncol=4)
> dimnames(rd1) = list(z.dates,nms)
>
> # this is the unadjusted raw data, that always has the same dimensions,
> rownames, and colnames as the report dates
> ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
>
> 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
>               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
>               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
>             nrow=8,ncol=4)
> dimnames(ua) = list(z.dates,nms)
>
> ################################# change anything below. I can't change
> anything above this line.
>
> # My first attempt at this was to solve part b of #1 above.
> fix = function(x)
> {
>   year = substring(x, 1, 4);
>   mo = substring(x, 5, 6);
>   day = substring(x, 7, 8);
>   ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
> }
>
> rd = apply(rd1, 2, fix)
> dimnames(rd) = dimnames(rd)
>
> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
> "day")
> dt = sapply(dt1, as.character)
>
> fin = dt
> ck_rows = length(dt)
> bad = character(0)
> start_t_all = Sys.time()
> for(cn in 1:ncol(ua)){
>   uac = ua[,cn]
>   tkr = colnames(ua)[cn]
>   rdc = rd[,cn]
>   ua_rd = cbind(uac,rdc)
>   colnames(ua_rd) = c(tkr,'rt_date')
>   xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
>   xx = as.character(xx1[,2])
>   values <- c(NA, xx[!is.na(xx)])
>   ind = cumsum(!is.na(xx)) + 1
>   y <- values[ind]
>   if(ck_rows == length(y)){
>     fin  = data.frame(fin,y)
>   }else{
>     bad = c(bad,tkr)
>   }
> }
>
> colnames(fin) = c('daily_dates',nms)
>
> # after this I would slice and dice the data into weekly, monthly, etc.
> periodicity as needed, but this leaves it in daily format which is as
> granular as I will get.
>
> print("over all time for loop")
> print(Sys.time()-start_t_all)
>
> Regards,
>
> Ben
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Rui Barradas
Hello,

>
> Mar 05, 2012; 8:53pm — by Ben quant Ben quant
> Hello,
>
> Does anyone know of a way I can speed this up?
>

Maybe, let's see.

>
> ################################# change anything below.
>

# Yes.
# First, start by using dates, not characters

fdate <- function(x, format="%Y%m%d"){
        DF <- data.frame(x)
        for(i in colnames(DF)){
                DF[, i] <- as.Date(DF[, i], format=format)
                class(DF[, i]) <- "Date"
        }
        DF
}

rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by = "day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)

# And vectorize
for(tkr in 1:ncol(ua)){
        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
        inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
        sapply(1:length(inxlist), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))

# Check results
str(fin)
str(fin1)
head(fin)
head(fin1)
tail(fin)
tail(fin1)


Note that 'fin' has facotrs, 'fin1' numerics.
I haven't timed it but I believe it should be faster.

Hope this helps,

Rui Barradas



Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
Just looking at this, but it looks like ix doesn't exist:
       sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
fin1[ix[[i]], tkr
+ 1] <<- ua[i, tkr])

 Trying to sort it out now.

Ben

On Mon, Mar 5, 2012 at 7:48 PM, Rui Barradas <[hidden email]> wrote:

> Hello,
>
> >
> > Mar 05, 2012; 8:53pm — by Ben quant Ben quant
> > Hello,
> >
> > Does anyone know of a way I can speed this up?
> >
>
> Maybe, let's see.
>
> >
> > ################################# change anything below.
> >
>
> # Yes.
> # First, start by using dates, not characters
>
> fdate <- function(x, format="%Y%m%d"){
>        DF <- data.frame(x)
>        for(i in colnames(DF)){
>                DF[, i] <- as.Date(DF[, i], format=format)
>                class(DF[, i]) <- "Date"
>        }
>        DF
> }
>
> rd1 <- fdate(rd1)
> # This is yours, use it.
> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
> "day")
> # Set up the result, no time expensive 'cbind' inside a loop
> fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
> fin1[, 1] <- dt1
> nr <- nrow(rd1)
>
> # And vectorize
> for(tkr in 1:ncol(ua)){
>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>        inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i +
> 1]))
>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> fin1[ix[[i]], tkr
> + 1] <<- ua[i, tkr])
> }
> colnames(fin1) <- c("daily_dates", colnames(ua))
>
> # Check results
> str(fin)
> str(fin1)
> head(fin)
> head(fin1)
> tail(fin)
> tail(fin1)
>
>
> Note that 'fin' has facotrs, 'fin1' numerics.
> I haven't timed it but I believe it should be faster.
>
> Hope this helps,
>
> Rui Barradas
>
>
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
I think this is what you meant:

z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
"20080514",  "20080814",  "20081114",  "20090217",
               "20070410","20070709","20071009",   "20080109",
"20080407",  "20080708",  "20081007",  "20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

##############################


fdate <- function(x, format="%Y%m%d"){
  DF <- data.frame(x)
  for(i in colnames(DF)){
    DF[, i] <- as.Date(DF[, i], format=format)
    class(DF[, i]) <- "Date"
  }
  DF
}

rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
  "day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)

# And vectorize
for(tkr in 1:ncol(ua)){
  x  <- c(rd1[, tkr], as.Date("9999-12-31"))
 # inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
  ix <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
  sapply(1:length(ix), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr +
1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))

# Check results
str(fin1)
head(fin1)
tail(fin1)

On Tue, Mar 6, 2012 at 7:34 AM, Ben quant <[hidden email]> wrote:

> Just looking at this, but it looks like ix doesn't exist:
>
>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> fin1[ix[[i]], tkr
> + 1] <<- ua[i, tkr])
>
>  Trying to sort it out now.
>
> Ben
>
>
> On Mon, Mar 5, 2012 at 7:48 PM, Rui Barradas <[hidden email]> wrote:
>
>> Hello,
>>
>> >
>> > Mar 05, 2012; 8:53pm — by Ben quant Ben quant
>> > Hello,
>> >
>> > Does anyone know of a way I can speed this up?
>> >
>>
>> Maybe, let's see.
>>
>> >
>> > ################################# change anything below.
>> >
>>
>> # Yes.
>> # First, start by using dates, not characters
>>
>> fdate <- function(x, format="%Y%m%d"){
>>        DF <- data.frame(x)
>>        for(i in colnames(DF)){
>>                DF[, i] <- as.Date(DF[, i], format=format)
>>                class(DF[, i]) <- "Date"
>>        }
>>        DF
>> }
>>
>> rd1 <- fdate(rd1)
>> # This is yours, use it.
>> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
>> "day")
>> # Set up the result, no time expensive 'cbind' inside a loop
>> fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
>> fin1[, 1] <- dt1
>> nr <- nrow(rd1)
>>
>> # And vectorize
>> for(tkr in 1:ncol(ua)){
>>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>>        inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i
>> + 1]))
>>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
>> fin1[ix[[i]], tkr
>> + 1] <<- ua[i, tkr])
>> }
>> colnames(fin1) <- c("daily_dates", colnames(ua))
>>
>> # Check results
>> str(fin)
>> str(fin1)
>> head(fin)
>> head(fin1)
>> tail(fin)
>> tail(fin1)
>>
>>
>> Note that 'fin' has facotrs, 'fin1' numerics.
>> I haven't timed it but I believe it should be faster.
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Rui Barradas
In reply to this post by Ben quant
Hello,

> Just looking at this, but it looks like ix doesn't exist:
>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
>
>  Trying to sort it out now.

Right, sorry.
I've changed the name from 'ix' to 'inxlist' to make it more readable just before posting.
And since the object 'ix' still existed in the R global environment it didn't throw an error...

Your correction in the post that followed is what I meant.

Correction (full loop, tested):

for(tkr in 1:ncol(ua)){
        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
        ix <- lapply(1:nr, function(i)
                        which(x[i] <= dt1 & dt1 < x[i + 1]))
        sapply(1:length(ix), function(i)
                if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
}

Rui Barradas
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
In reply to this post by Ben quant
Unfortunately, your solution is does not scale well.  (Tough for you to
test this without my real data.)  If ua is my data and rd1 are my report
dates (same as the code below) and I use more columns, it appears that your
solution slows considerably. Remember I have ~11k columns in my real data,
so scalability is critical.

Here are the processing times using real data:

Use 4 columns:
ua = ua[,1:4]
rd1 = rd1[,1:4]
mine: 2.4 sec's
yours: 1.39 sec's   Note: yours is faster with 4 columns (like the mockup
data I provided.)

Use 150 columns:
ua = ua[,1:150]
rd1 = rd1[,1:150]
mine: 5 sec's
yours: 9 sec's

Use 300 columns:
ua = ua[,1:300]
rd1 = rd1[,1:300]
mine: 9.5 sec's
yours: 1 min


##################### data
Here is the mockup date and code used: (Anyone looking to test the
scalability may want to add more columns.)

Mockup date:
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
"20080514",  "20080814",  "20081114",  "20090217",
               "20070410","20070709","20071009",   "20080109",
"20080407",  "20080708",  "20081007",  "20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)


######################## My code:

start_t_all = Sys.time()
nms = colnames(ua)

fix = function(x)
{
  year = substring(x, 1, 4);
  mo = substring(x, 5, 6);
  day = substring(x, 7, 8);
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

dt1 <- seq(from =as.Date(z.dates[1]), to =
as.Date(z.dates[length(z.dates)]), by =
  "day")
dt = sapply(dt1, as.character)

fin = dt
ck_rows = length(dt)
bad = character(0)

for(cn in 1:ncol(ua)){
  uac = ua[,cn]
  tkr = colnames(ua)[cn]
  rdc = rd[,cn]
  ua_rd = cbind(uac,rdc)
  colnames(ua_rd) = c(tkr,'rt_date')
  xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)
  xx = as.character(xx1[,2])
  values <- c(NA, xx[!is.na(xx)])
  ind = cumsum(!is.na(xx)) + 1
  y <- values[ind]
  if(ck_rows == length(y)){
    fin  = data.frame(fin,y)
  }else{
    bad = c(bad,tkr)
  }
}
if(length(bad)){
  nms = nms[bad != nms]
}
colnames(fin) = c('daily_dates',nms)

print("over all time for loop")
print(Sys.time()-start_t_all)


 ################################### Your code:


z.dates = rownames(ua)

start_t_all = Sys.time()
fdate <- function(x, format="%Y%m%d"){
  DF <- data.frame(x)
  for(i in colnames(DF)){
    DF[, i] <- as.Date(DF[, i], format=format)
    class(DF[, i]) <- "Date"
  }
  DF
}

rd1 <- fdate(rd1)
# This is yours, use it.
dt1 <- seq(from =as.Date(z.dates[1]), to =
as.Date(z.dates[length(z.dates)]), by ="day")
# Set up the result, no time expensive 'cbind' inside a loop
fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
fin1[, 1] <- dt1
nr <- nrow(rd1)

# And vectorize
for(tkr in 1:ncol(ua)){
  x  <- c(rd1[, tkr], as.Date("9999-12-31"))
  # inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
  ix <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
  sapply(1:length(ix), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr +
1] <<- ua[i, tkr])
}
colnames(fin1) <- c("daily_dates", colnames(ua))
print(Sys.time()-start_t_all)


Thanks for your efforts though,

ben

On Tue, Mar 6, 2012 at 7:39 AM, Ben quant <[hidden email]> wrote:

> I think this is what you meant:
>
>
> z.dates =
> c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
>
> nms = c("A","B","C","D")
> # these are the report dates that are the real days the data was available
> rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
> "20080514",  "20080814",  "20081114",  "20090217",
>                "20070410","20070709","20071009",   "20080109",
> "20080407",  "20080708",  "20081007",  "20090112",
>                "20070426","--","--","--","--","--","--","20090319",
>                "--","--","--","--","--","--","--","--"),
>              nrow=8,ncol=4)
> dimnames(rd1) = list(z.dates,nms)
>
> # this is the unadjusted raw data, that always has the same dimensions,
> rownames, and colnames as the report dates
> ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
>
> 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
>               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
>               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
>             nrow=8,ncol=4)
> dimnames(ua) = list(z.dates,nms)
>
> ##############################
>
>
> fdate <- function(x, format="%Y%m%d"){
>   DF <- data.frame(x)
>   for(i in colnames(DF)){
>     DF[, i] <- as.Date(DF[, i], format=format)
>     class(DF[, i]) <- "Date"
>   }
>   DF
> }
>
> rd1 <- fdate(rd1)
> # This is yours, use it.
> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
>   "day")
> # Set up the result, no time expensive 'cbind' inside a loop
> fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
> fin1[, 1] <- dt1
> nr <- nrow(rd1)
>
> # And vectorize
> for(tkr in 1:ncol(ua)){
>   x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>  # inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
>   ix <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i + 1]))
>   sapply(1:length(ix), function(i) if(length(ix[[i]])) fin1[ix[[i]], tkr +
> 1] <<- ua[i, tkr])
>
> }
> colnames(fin1) <- c("daily_dates", colnames(ua))
>
> # Check results
> str(fin1)
> head(fin1)
> tail(fin1)
>
>
> On Tue, Mar 6, 2012 at 7:34 AM, Ben quant <[hidden email]> wrote:
>
>> Just looking at this, but it looks like ix doesn't exist:
>>
>>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
>> fin1[ix[[i]], tkr
>> + 1] <<- ua[i, tkr])
>>
>>  Trying to sort it out now.
>>
>> Ben
>>
>>
>> On Mon, Mar 5, 2012 at 7:48 PM, Rui Barradas <[hidden email]> wrote:
>>
>>> Hello,
>>>
>>> >
>>> > Mar 05, 2012; 8:53pm — by Ben quant Ben quant
>>> > Hello,
>>> >
>>> > Does anyone know of a way I can speed this up?
>>> >
>>>
>>> Maybe, let's see.
>>>
>>> >
>>> > ################################# change anything below.
>>> >
>>>
>>> # Yes.
>>> # First, start by using dates, not characters
>>>
>>> fdate <- function(x, format="%Y%m%d"){
>>>        DF <- data.frame(x)
>>>        for(i in colnames(DF)){
>>>                DF[, i] <- as.Date(DF[, i], format=format)
>>>                class(DF[, i]) <- "Date"
>>>        }
>>>        DF
>>> }
>>>
>>> rd1 <- fdate(rd1)
>>> # This is yours, use it.
>>> dt1 <- seq(from =as.Date(z.dates[1]), to = as.Date("2009-03-25"), by =
>>> "day")
>>> # Set up the result, no time expensive 'cbind' inside a loop
>>> fin1 <- data.frame(matrix(NA, nrow=length(dt1), ncol=ncol(ua) + 1))
>>> fin1[, 1] <- dt1
>>> nr <- nrow(rd1)
>>>
>>> # And vectorize
>>> for(tkr in 1:ncol(ua)){
>>>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>>>        inxlist <- lapply(1:nr, function(i) which(x[i] <= dt1 & dt1 < x[i
>>> + 1]))
>>>        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
>>> fin1[ix[[i]], tkr
>>> + 1] <<- ua[i, tkr])
>>> }
>>> colnames(fin1) <- c("daily_dates", colnames(ua))
>>>
>>> # Check results
>>> str(fin)
>>> str(fin1)
>>> head(fin)
>>> head(fin1)
>>> tail(fin)
>>> tail(fin1)
>>>
>>>
>>> Note that 'fin' has facotrs, 'fin1' numerics.
>>> I haven't timed it but I believe it should be faster.
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4448567.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
In reply to this post by Rui Barradas
Hello,

In case anyone is interested in a faster solution for lots of columns. This
solution is slower if you only have a few columns.  If anyone has anything
faster, I would be interested in seeing it.

### some mockup data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D") # add more columns to see how the code below is
fsater
# these are the report dates that are the real days the data was available,
so show the data the day after this date ('after' is a design decision)
rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
"20080514",  "20080814",  "20081114",  "20090217",
               "20070410","20070709","20071009",   "20080109",
"20080407",  "20080708",  "20081007",  "20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################ the fastest code I have found:

start_t_all = Sys.time()
fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
#wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))

for(i in 1:length(wd)){
  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_col_idx = which(rd[rd_row_idx,] < d)

  if((rd_row_idx - 1) > 0){
    mat[i,] = ua[rd_row_idx - 1,]
  }
  if( length(rd_col_idx)){
    mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
  }
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)

Regards,

Ben

On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas <[hidden email]> wrote:

> Hello,
>
> > Just looking at this, but it looks like ix doesn't exist:
> >        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> > fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> >
> >  Trying to sort it out now.
>
> Right, sorry.
> I've changed the name from 'ix' to 'inxlist' to make it more readable just
> before posting.
> And since the object 'ix' still existed in the R global environment it
> didn't throw an error...
>
> Your correction in the post that followed is what I meant.
>
> Correction (full loop, tested):
>
> for(tkr in 1:ncol(ua)){
>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>         ix <- lapply(1:nr, function(i)
>                         which(x[i] <= dt1 & dt1 < x[i + 1]))
>         sapply(1:length(ix), function(i)
>                 if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> }
>
> Rui Barradas
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Rui Barradas
Hello again.

Ben quant wrote
Hello,

In case anyone is interested in a faster solution for lots of columns. This
solution is slower if you only have a few columns.  If anyone has anything
faster, I would be interested in seeing it.

### some mockup data
z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D") # add more columns to see how the code below is
fsater
# these are the report dates that are the real days the data was available,
so show the data the day after this date ('after' is a design decision)
rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
"20080514",  "20080814",  "20081114",  "20090217",
               "20070410","20070709","20071009",   "20080109",
"20080407",  "20080708",  "20081007",  "20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

################################ the fastest code I have found:

start_t_all = Sys.time()
fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
}

rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
#wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))

for(i in 1:length(wd)){
  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_col_idx = which(rd[rd_row_idx,] < d)

  if((rd_row_idx - 1) > 0){
    mat[i,] = ua[rd_row_idx - 1,]
  }
  if( length(rd_col_idx)){
    mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
  }
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)

Regards,

Ben

On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas <[hidden email]> wrote:

> Hello,
>
> > Just looking at this, but it looks like ix doesn't exist:
> >        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> > fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> >
> >  Trying to sort it out now.
>
> Right, sorry.
> I've changed the name from 'ix' to 'inxlist' to make it more readable just
> before posting.
> And since the object 'ix' still existed in the R global environment it
> didn't throw an error...
>
> Your correction in the post that followed is what I meant.
>
> Correction (full loop, tested):
>
> for(tkr in 1:ncol(ua)){
>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
>         ix <- lapply(1:nr, function(i)
>                         which(x[i] <= dt1 & dt1 < x[i + 1]))
>         sapply(1:length(ix), function(i)
>                 if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> }
>
> Rui Barradas
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Maybe I'm not understanding the problem very well, but let me describe what I'm thinking it is.

You have two tables, 'rd1' and 'ua' and a vector of dates, 'z.dates'.  The result is a table such that:
1. From 'z.dates' make a vector of daily dates.
2. Each column is filled with numbers from 'ua' based on dates in 'rd1', starting at the day given in step 1.
My doubt is that your last posted code seems to give a special role to column 'A'.

> mat[225:232, ]
                               A       B              C      D
2007-11-10 636.16 2645     NaN NaN
2007-11-11 636.16 2645     NaN NaN
2007-11-12 636.16 2645     NaN NaN
2007-11-13 636.16 2645     NaN NaN
2007-11-14 636.16 2645     NaN NaN
2007-11-15 655.91 2645 -98.426 NaN
2007-11-16 655.91 2645 -98.426 NaN
2007-11-17 655.91 2645 -98.426 NaN

The values in column 'C' change following the date in column 'A'. That is the third date in 'rd1',
more exactly, rd1[3, 1] == "20071115".

Shouldn't the values in mat[, "C"] start at 2009-03-20? The corresponding value in 'ua' would then be 144.138.

(I still believe this can be made much faster.)

Rui Barradas
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
Humm.... If I understand what you are saying, you are correct. I get
144.138 for 2009-03-20 for column C. Maybe I posted the wrong code?  If so,
sorry.  Let me know if you disagree. I still plan to come back to this and
optimize it more, so if you see anything that would make it faster that
would be great. Of course, the for loop is my focus for optimization. Due
to some issues in the real data I had to add the lag and lag2 stuff in (I
don't think I had that before). In my real data the values don't really
belong in the z.dates the are aligned with, but to avoid lots of empty
values in the flat matrix (ua) they were forced in. I can push them into
their "real" dates via looking at a deeper lag. I'm thinking that all the
"which" stuff in the for look can be nested so that it runs faster. Also
the as.Date, abs() and max(which( etc. stuff seems like it could be handled
better/faster or outside the loop.

If you are interested in helping further, I can post a link to some 'real'
data.

Here is what I am using now and it seems to work.  Sorry, my code is still
very fluid:

z.dates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rd1 =
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rd1) = list(z.dates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(ua) = list(z.dates,nms)

z.dates = rownames(ua)
############################## by rows
##########################################  FASTEST

start_t_all = Sys.time()
fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))

}
rd = apply(rd1, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
rownames(mat) = wd
nms = as.Date(rownames(ua))

for(i in 1:length(wd)){
  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
  rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d")  <
d)
  rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,], format="%Y-%m-%d")
< d)

  if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){

    mat[i,rd_col_idx_lag2] = ua[rd_row_idx - 2,rd_col_idx_lag2]
  }
  if(length(rd_col_idx_lag)){
    mat[i,rd_col_idx_lag] = ua[rd_row_idx - 1,rd_col_idx_lag]
  }
  if( length(rd_col_idx)){
    mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
  }
}
colnames(mat)=colnames(ua)
print(Sys.time()-start_t_all)


Let me know if you disagree,

Ben

On Wed, Mar 7, 2012 at 5:57 PM, Rui Barradas <[hidden email]> wrote:

> Hello again.
>
>
> Ben quant wrote
> >
> > Hello,
> >
> > In case anyone is interested in a faster solution for lots of columns.
> > This
> > solution is slower if you only have a few columns.  If anyone has
> anything
> > faster, I would be interested in seeing it.
> >
> > ### some mockup data
> > z.dates =
> >
> c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")
> >
> > nms = c("A","B","C","D") # add more columns to see how the code below is
> > fsater
> > # these are the report dates that are the real days the data was
> > available,
> > so show the data the day after this date ('after' is a design decision)
> > rd1 = matrix(c("20070514","20070814","20071115",   "20080213",
> > "20080514",  "20080814",  "20081114",  "20090217",
> >                "20070410","20070709","20071009",   "20080109",
> > "20080407",  "20080708",  "20081007",  "20090112",
> >                "20070426","--","--","--","--","--","--","20090319",
> >                "--","--","--","--","--","--","--","--"),
> >              nrow=8,ncol=4)
> > dimnames(rd1) = list(z.dates,nms)
> >
> > # this is the unadjusted raw data, that always has the same dimensions,
> > rownames, and colnames as the report dates
> > ua = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,
> >
> > 2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
> >               NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
> >               NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
> >             nrow=8,ncol=4)
> > dimnames(ua) = list(z.dates,nms)
> >
> > ################################ the fastest code I have found:
> >
> > start_t_all = Sys.time()
> > fix = function(x)
> > {
> >   year = substring(x, 1, 4)
> >   mo = substring(x, 5, 6)
> >   day = substring(x, 7, 8)
> >   ifelse(year=="--", "NA", paste(year, mo, day, sep = "-"))
> > }
> >
> > rd = apply(rd1, 2, fix)
> > dimnames(rd) = dimnames(rd)
> >
> > wd1 <- seq(from =as.Date(min(z.dates)), to = Sys.Date(), by = "day")
> > #wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
> > wd = sapply(wd1, as.character)
> >
> > mat = matrix(NA,nrow=length(wd),ncol=ncol(ua))
> > rownames(mat) = wd
> > nms = as.Date(rownames(ua))
> >
> > for(i in 1:length(wd)){
> >   d = as.Date(wd[i])
> >   diff = abs(nms - d)
> >   rd_row_idx = max(which(diff == min(diff)))
> >   rd_col_idx = which(rd[rd_row_idx,] < d)
> >
> >   if((rd_row_idx - 1) > 0){
> >     mat[i,] = ua[rd_row_idx - 1,]
> >   }
> >   if( length(rd_col_idx)){
> >     mat[i,rd_col_idx] = ua[rd_row_idx,rd_col_idx]
> >   }
> > }
> > colnames(mat)=colnames(ua)
> > print(Sys.time()-start_t_all)
> >
> > Regards,
> >
> > Ben
> >
> > On Tue, Mar 6, 2012 at 8:22 AM, Rui Barradas &lt;rui1174@&gt; wrote:
> >
> >> Hello,
> >>
> >> > Just looking at this, but it looks like ix doesn't exist:
> >> >        sapply(1:length(inxlist), function(i) if(length(ix[[i]]))
> >> > fin1[ix[[i]], tkr + 1] <<- ua[i, tkr])
> >> >
> >> >  Trying to sort it out now.
> >>
> >> Right, sorry.
> >> I've changed the name from 'ix' to 'inxlist' to make it more readable
> >> just
> >> before posting.
> >> And since the object 'ix' still existed in the R global environment it
> >> didn't throw an error...
> >>
> >> Your correction in the post that followed is what I meant.
> >>
> >> Correction (full loop, tested):
> >>
> >> for(tkr in 1:ncol(ua)){
> >>        x  <- c(rd1[, tkr], as.Date("9999-12-31"))
> >>         ix <- lapply(1:nr, function(i)
> >>                         which(x[i] <= dt1 & dt1 < x[i + 1]))
> >>         sapply(1:length(ix), function(i)
> >>                 if(length(ix[[i]])) fin1[ix[[i]], tkr + 1] <<- ua[i,
> >> tkr])
> >> }
> >>
> >> Rui Barradas
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4450186.html
> >> Sent from the R help mailing list archive at Nabble.com.
> >>
> >> ______________________________________________
> >> R-help@ mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@ mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> Maybe I'm not understanding the problem very well, but let me describe what
> I'm thinking it is.
>
> You have two tables, 'rd1' and 'ua' and a vector of dates, 'z.dates'.  The
> result is a table such that:
> 1. From 'z.dates' make a vector of daily dates.
> 2. Each column is filled with numbers from 'ua' based on dates in 'rd1',
> starting at the day given in step 1.
> My doubt is that your last posted code seems to give a special role to
> column 'A'.
>
> > mat[225:232, ]
>                               A       B              C      D
> 2007-11-10 636.16 2645     NaN NaN
> 2007-11-11 636.16 2645     NaN NaN
> 2007-11-12 636.16 2645     NaN NaN
> 2007-11-13 636.16 2645     NaN NaN
> 2007-11-14 636.16 2645     NaN NaN
> 2007-11-15 655.91 2645 -98.426 NaN
> 2007-11-16 655.91 2645 -98.426 NaN
> 2007-11-17 655.91 2645 -98.426 NaN
>
> The values in column 'C' change following the date in column 'A'. That is
> the third date in 'rd1',
> more exactly, rd1[3, 1] == "20071115".
>
> Shouldn't the values in mat[, "C"] start at 2009-03-20? The corresponding
> value in 'ua' would then be 144.138.
>
> (I still believe this can be made much faster.)
>
> Rui Barradas
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4455223.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Rui Barradas
Hello,

> Humm.... If I understand what you are saying, you are correct. I get
> 144.138 for 2009-03-20 for column C. Maybe I posted the wrong code?  If so,
> sorry.

I think I have the fastest so far solution, and it checks with your corrected,last one.

I've made just a change: to transform it into a function I renamed the parameters
(only for use inside the function) 'zdates', without the period, 'rddata' and 'uadata'.

'fun1' is yours, 'fun2', mine. Here it goes.


fun1 <- function(zdates, rddata, uadata){
    fix = function(x)
    {
      year = substring(x, 1, 4)
      mo = substring(x, 5, 6)
      day = substring(x, 7, 8)
      ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))

    }
    rd = apply(rddata, 2, fix)
    dimnames(rd) = dimnames(rd)

    wd1 <- seq(from =as.Date(min(zdates)), to = Sys.Date(), by = "day")
    #wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
    wd = sapply(wd1, as.character)
    mat = matrix(NA,nrow=length(wd),ncol=ncol(uadata))
    rownames(mat) = wd
    nms = as.Date(rownames(uadata))

    for(i in 1:length(wd)){
      d = as.Date(wd[i])
      diff = abs(nms - d)
      rd_row_idx = max(which(diff == min(diff)))
      rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
      rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d")  < d)
      rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,], format="%Y-%m-%d")  < d)

      if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){
        mat[i,rd_col_idx_lag2] = uadata[rd_row_idx - 2,rd_col_idx_lag2]
      }
      if(length(rd_col_idx_lag)){
        mat[i,rd_col_idx_lag] = uadata[rd_row_idx - 1,rd_col_idx_lag]
      }
      if( length(rd_col_idx)){
        mat[i,rd_col_idx] = uadata[rd_row_idx,rd_col_idx]
      }
    }
    colnames(mat)=colnames(uadata)
    mat
}

fun2 <- function(zdates, rddata, uadata){

        fdate <- function(x, format="%Y%m%d"){
                DF <- data.frame(x)
                for(i in colnames(DF)){
                        DF[, i] <- as.Date(DF[, i], format=format)
                        class(DF[, i]) <- "Date"
                }
                DF
        }

        rddata <- fdate(rddata)
        wd1 <- seq(from = as.Date(zdates[1]), to = Sys.Date(), by = "day")
        nwd1 <- length(wd1)

        fin1 <- matrix(NA, nrow=length(wd1), ncol=ncol(uadata))
        nr <- nrow(rddata)
        xstart <- c(integer(nr), nwd1)
        for(j in 1:ncol(uadata)){
                x <- xstart
                for(i in 1:nr)
                        x[i] <- if(!is.na(rddata[i, j]) & !is.nan(rddata[i, j]))
                                        which(wd1 == rddata[i, j])
                                        else NA
                ix <- which(!is.na(x))
                for(i in seq_len(length(ix) - 1)){
                        from <- x[ ix[i] ] + 1
                        to   <- x[ ix[i + 1] ]
                        fin1[ from:to, j ] <- uadata[ ix[i], j ]
                }
        }
        colnames(fin1) <- colnames(uadata)
        rownames(fin1) <- as.character(wd1)
        fin1
}

t1 <- system.time(m1 <- fun1(z.dates, rd1, ua))
t2 <- system.time(m2 <- fun2(z.dates, rd1, ua))

all.equal(m1, m2)
[1] TRUE

rbind(t1, t2)
   user.self sys.self elapsed user.child sys.child
t1      1.50        0    1.50         NA        NA
t2      0.02        0    0.01         NA        NA

And the better news is that I believe it scales up without degrading performance,
like my first did.

See if it works.

Rui Barradas

Reply | Threaded
Open this post in threaded view
|

Re: index instead of loop?

Ben quant
Here is my latest. I kind of changed the problem (for speed). In real life
I have over 300 uadata type matrices, each having over 20 rows and over
11,000 columns. However the rddata file is valid for all of the uadata
matrices that I have (300). What I am doing now: I'm creating a matrix of
row indices which will either lag the row values or not based on the report
data (rddata). Then I apply that matrix of row indices to each uadata data
item (300 times) to create a matrix of the correctly row adjusted data
items for the correct columns of the dimensions and periodicity that I want
(weekly in this case). The key being, I only do the 'adjustment' once
(which is comparatively slow) and I apply those results to the data matrix
(fast!).

I'm open to ideas. I put this together quickly so hopefully all is well.

#########sample data
zdates =
c("2007-03-31","2007-06-30","2007-09-30","2007-12-31","2008-03-31","2008-06-30","2008-09-30","2008-12-31")

nms = c("A","B","C","D")
# these are the report dates that are the real days the data was available
rddata =
matrix(c("20070514","20070814","20071115","20080213","20080514","20080814","20081114","20090217",

"20070410","20070709","20071009","20080109","20080407","20080708","20081007","20090112",
               "20070426","--","--","--","--","--","--","20090319",
               "--","--","--","--","--","--","--","--"),
             nrow=8,ncol=4)
dimnames(rddata) = list(zdates,nms)

# this is the unadjusted raw data, that always has the same dimensions,
rownames, and colnames as the report dates
uadata = matrix(c(640.35,636.16,655.91,657.41,682.06,702.90,736.15,667.65,

2625.050,2625.050,2645.000,2302.000,1972.000,1805.000,1547.000,1025.000,
              NaN, NaN,-98.426,190.304,180.894,183.220,172.520, 144.138,
              NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN),
            nrow=8,ncol=4)
dimnames(uadata) = list(zdates,nms)

#################################################### I do this once

fix = function(x)
{
  year = substring(x, 1, 4)
  mo = substring(x, 5, 6)
  day = substring(x, 7, 8)
  ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))

}
rd = apply(rddata, 2, fix)
dimnames(rd) = dimnames(rd)

wd1 <- seq(from =as.Date(min(zdates)), to = Sys.Date(), by = "day")
wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
wd = sapply(wd1, as.character)

mat = matrix(NA,nrow=length(wd),ncol=ncol(uadata))
rownames(mat) = wd
nms = as.Date(rownames(uadata))

for(i in 1:length(wd)){

  d = as.Date(wd[i])
  diff = abs(nms - d)
  rd_row_idx = max(which(diff == min(diff)))
  rd_row_idx_lag = rd_row_idx - 1
  rd_row_idx_lag2 = rd_row_idx - 2
  rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
  rd_col_idx_lag = which(as.Date(rd[rd_row_idx_lag,], format="%Y-%m-%d")  <
d)
  rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx_lag2,], format="%Y-%m-%d")
< d)

  ## if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){
  if(rd_row_idx_lag2 > 0){
    # mat[i,rd_col_idx_lag2] = ua[rd_row_idx_lag2,rd_col_idx_lag2]
    mat[i,rd_col_idx_lag2] = rd_row_idx_lag2
  }
  #if(length(rd_col_idx_lag)){
  mat[i,rd_col_idx_lag] = rd_row_idx_lag
  #}
  #if( length(rd_col_idx)){
  mat[i,rd_col_idx] = rd_row_idx
  #}
  }

indx = mat
vals = uadata
########################## I do this 300 times

x =
matrix(vals[cbind(c(indx),rep(1:ncol(indx),each=nrow(indx)))],nrow=nrow(indx),ncol=ncol(indx))

Regards,

ben

On Thu, Mar 8, 2012 at 11:40 AM, Rui Barradas <[hidden email]> wrote:

> Hello,
>
> > Humm.... If I understand what you are saying, you are correct. I get
> > 144.138 for 2009-03-20 for column C. Maybe I posted the wrong code?  If
> > so,
> > sorry.
>
> I think I have the fastest so far solution, and it checks with your
> corrected,last one.
>
> I've made just a change: to transform it into a function I renamed the
> parameters
> (only for use inside the function) 'zdates', without the period, 'rddata'
> and 'uadata'.
>
> 'fun1' is yours, 'fun2', mine. Here it goes.
>
>
> fun1 <- function(zdates, rddata, uadata){
>     fix = function(x)
>    {
>      year = substring(x, 1, 4)
>      mo = substring(x, 5, 6)
>      day = substring(x, 7, 8)
>      ifelse(year=="--", "--", paste(year, mo, day, sep = "-"))
>
>    }
>     rd = apply(rddata, 2, fix)
>    dimnames(rd) = dimnames(rd)
>
>    wd1 <- seq(from =as.Date(min(zdates)), to = Sys.Date(), by = "day")
>     #wd1 = wd1[weekdays(wd1) == "Friday"] # uncomment to go weekly
>    wd = sapply(wd1, as.character)
>     mat = matrix(NA,nrow=length(wd),ncol=ncol(uadata))
>    rownames(mat) = wd
>    nms = as.Date(rownames(uadata))
>
>    for(i in 1:length(wd)){
>      d = as.Date(wd[i])
>      diff = abs(nms - d)
>      rd_row_idx = max(which(diff == min(diff)))
>      rd_col_idx = which(as.Date(rd[rd_row_idx,], format="%Y-%m-%d")  < d)
>      rd_col_idx_lag = which(as.Date(rd[rd_row_idx - 1,], format="%Y-%m-%d")
> < d)
>      rd_col_idx_lag2 = which(as.Date(rd[rd_row_idx - 2,],
> format="%Y-%m-%d")  < d)
>
>      if(length(rd_col_idx_lag2) && (rd_row_idx - 2) > 0){
>         mat[i,rd_col_idx_lag2] = uadata[rd_row_idx - 2,rd_col_idx_lag2]
>      }
>      if(length(rd_col_idx_lag)){
>        mat[i,rd_col_idx_lag] = uadata[rd_row_idx - 1,rd_col_idx_lag]
>      }
>      if( length(rd_col_idx)){
>        mat[i,rd_col_idx] = uadata[rd_row_idx,rd_col_idx]
>      }
>    }
>    colnames(mat)=colnames(uadata)
>    mat
> }
>
> fun2 <- function(zdates, rddata, uadata){
>
>        fdate <- function(x, format="%Y%m%d"){
>                DF <- data.frame(x)
>                for(i in colnames(DF)){
>                        DF[, i] <- as.Date(DF[, i], format=format)
>                        class(DF[, i]) <- "Date"
>                }
>                DF
>        }
>
>         rddata <- fdate(rddata)
>        wd1 <- seq(from = as.Date(zdates[1]), to = Sys.Date(), by = "day")
>        nwd1 <- length(wd1)
>
>        fin1 <- matrix(NA, nrow=length(wd1), ncol=ncol(uadata))
>        nr <- nrow(rddata)
>        xstart <- c(integer(nr), nwd1)
>        for(j in 1:ncol(uadata)){
>                x <- xstart
>                for(i in 1:nr)
>                        x[i] <- if(!is.na(rddata[i, j]) &
> !is.nan(rddata[i, j]))
>                                        which(wd1 == rddata[i, j])
>                                        else NA
>                ix <- which(!is.na(x))
>                for(i in seq_len(length(ix) - 1)){
>                        from <- x[ ix[i] ] + 1
>                        to   <- x[ ix[i + 1] ]
>                        fin1[ from:to, j ] <- uadata[ ix[i], j ]
>                }
>        }
>        colnames(fin1) <- colnames(uadata)
>        rownames(fin1) <- as.character(wd1)
>        fin1
> }
>
> t1 <- system.time(m1 <- fun1(z.dates, rd1, ua))
> t2 <- system.time(m2 <- fun2(z.dates, rd1, ua))
>
> all.equal(m1, m2)
> [1] TRUE
>
> rbind(t1, t2)
>   user.self sys.self elapsed user.child sys.child
> t1      1.50        0    1.50         NA        NA
> t2      0.02        0    0.01         NA        NA
>
> And the better news is that I believe it scales up without degrading
> performance,
> like my first did.
>
> See if it works.
>
> Rui Barradas
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/index-instead-of-loop-tp4447672p4457290.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.