Conditionally adding a constant

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Conditionally adding a constant

vioravis
I am trying to add a constant to the previous value of a variable based on certain conditions. Maybe there is a simple way to do this that I am missing completely. I have given an example below:

df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))

> df
  x  y
1 1 10
2 2 20
3 3 30
4 4 NA
5 5 NA

I want to add 2 to the previous value of y, if x exceeds 3 (also will have to handle NAs in the process). The resulting output would look like:

  x  y
1 1 10
2 2 20
3 3 30
4 4 32
5 5 34

Can someone please explain how to do it? Thank you.

Ravi








Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

Rui Barradas
Hello,

I believe this works.

f1 <- function(x){
        for(i in 2:length(x)) x[i] <- ifelse(x[i-1] > 3, x[i-1] + 2, x[i])
        x
}

f2 <- function(x){
        for(i in 2:length(x)) x[i] <- ifelse(is.na(x[i]) & (x[i-1] > 3), x[i-1] + 2, x[i])
        x
}

df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))

apply(df, 2, f1)      # df$x[4] > 3, df$x[5] also changes
apply(df, 2, f2)      # only df$y has NA's

Maybe there's a better way, avoiding the loop.

Rui Barradas
Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

Joshua Wiley-2
Here is another approach.  Probably with some thought and fingerwork,
rle() could be used to avoid the while loop, but that should only slow
things down if there are long runs of NAs --- there can be a lot of
NAs as long as they are spaced apart and it should still be quite
efficient.

f <- function(x, y) {
  i <- which(x > 3)
  cond <- TRUE
  while (cond) {
    y[i] <- y[i - 1] + 2L
    cond <- any(is.na(y))
  }
  return(y)
}

df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))

df$y <- f(df$x, df$y)

Cheers,

Josh

On Mon, Jan 2, 2012 at 4:47 AM, Rui Barradas <[hidden email]> wrote:

> Hello,
>
> I believe this works.
>
> f1 <- function(x){
>        for(i in 2:length(x)) x[i] <- ifelse(x[i-1] > 3, x[i-1] + 2, x[i])
>        x
> }
>
> f2 <- function(x){
>        for(i in 2:length(x)) x[i] <- ifelse(is.na(x[i]) & (x[i-1] > 3), x[i-1] +
> 2, x[i])
>        x
> }
>
> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
>
> apply(df, 2, f1)      # df$x[4] > 3, df$x[5] also changes
> apply(df, 2, f2)      # only df$y has NA's
>
> Maybe there's a better way, avoiding the loop.
>
> Rui Barradas
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Conditionally-adding-a-constant-tp4253049p4253125.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

jholtman
Here is a way of doing it without loops:

> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
>
> require(zoo)  # need na.locf to fix the NAs
>
> # replace NA with preceeding values
> df$y <- na.locf(df$y)
> df
  x  y
1 1 10
2 2 20
3 3 30
4 4 30
5 5 30
>
> # assuming that you want to increment the counts when x > 3
> inc <- cumsum(df$x > 3) * 2
> inc
[1] 0 0 0 2 4
>
> df$y <- df$y + inc
> df
  x  y
1 1 10
2 2 20
3 3 30
4 4 32
5 5 34
>
>
>
>


On Mon, Jan 2, 2012 at 1:59 PM, Joshua Wiley <[hidden email]> wrote:

> Here is another approach.  Probably with some thought and fingerwork,
> rle() could be used to avoid the while loop, but that should only slow
> things down if there are long runs of NAs --- there can be a lot of
> NAs as long as they are spaced apart and it should still be quite
> efficient.
>
> f <- function(x, y) {
>  i <- which(x > 3)
>  cond <- TRUE
>  while (cond) {
>    y[i] <- y[i - 1] + 2L
>    cond <- any(is.na(y))
>  }
>  return(y)
> }
>
> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
>
> df$y <- f(df$x, df$y)
>
> Cheers,
>
> Josh
>
> On Mon, Jan 2, 2012 at 4:47 AM, Rui Barradas <[hidden email]> wrote:
>> Hello,
>>
>> I believe this works.
>>
>> f1 <- function(x){
>>        for(i in 2:length(x)) x[i] <- ifelse(x[i-1] > 3, x[i-1] + 2, x[i])
>>        x
>> }
>>
>> f2 <- function(x){
>>        for(i in 2:length(x)) x[i] <- ifelse(is.na(x[i]) & (x[i-1] > 3), x[i-1] +
>> 2, x[i])
>>        x
>> }
>>
>> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
>>
>> apply(df, 2, f1)      # df$x[4] > 3, df$x[5] also changes
>> apply(df, 2, f2)      # only df$y has NA's
>>
>> Maybe there's a better way, avoiding the loop.
>>
>> Rui Barradas
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Conditionally-adding-a-constant-tp4253049p4253125.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

Rui Barradas
In reply to this post by vioravis
Hello again,

I believe we are all missing something. Isn't it possible to have NAs as the first values of 'y'?
And isn't it also possible to have x[1] > 3?

Here is my point (I have changed function 'f2' to predict for such cases, 'f1' is rubbish)

# Rui
f3 <- function(x, y){
        inx <- which(x > 3)
        ynx <- which(is.na(y))
        for(i in which(inx %in% ynx)) y[ynx[i]] <- y[ynx[i]-1] + 2L
        y
}

# Jim's, as a function, 'na.rm' option added or else 'df3' would produce an error
require(zoo)
f4 <- function(x, y){
        y <- na.locf(y, na.rm=FALSE)
        inc <- cumsum(x > 3) * 2
        y + inc
}

df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
df
df2 <- data.frame(x = c(1,2,3,4,5), y = c(10,20,NA,40,NA))
df2
df3 <- data.frame(x = c(1,2,3,4,5), y = rev(c(10,20,30,NA,NA)))
df3

# Joshua
f(df$x, df$y)      # works
f(df2$x, df2$y)    # infinite loop
f(df3$x, df3$y)    # infinite loop

# Rui
f3(df$x, df$y)     # works
f3(df2$x, df2$y)   # works as expected?
f3(df3$x, df3$y)   # works as expected?

# Jim
f4(df$x, df$y)     # works
f4(df2$x, df2$y)   # works as expected?
f4(df3$x, df3$y)   # works as expected?

If this makes sense, the performance tests are very much in favour of Jim's solution.


# If this is what is asked for, test the performance
# with large enough N
N <- 1.e5
dftest <- data.frame(x=1:N, y=c(sample(c(rep(NA, 5), 10*1:5), N, replace=TRUE)))

sum(is.na(dftest))/N    # proportion of NAs in 'dftest'

t2 <- system.time(invisible(apply(dftest, 2, f2)))[c(1, 3)]
t3 <- system.time(invisible(f3(dftest$x, dftest$y)))[c(1, 3)]
t4 <- system.time(invisible(f4(dftest$x, dftest$y)))[c(1, 3)]
rbind(t2=t2, t3=t3, t4=t4, t2.t3=t2/t3, t2.t4=t2/t4, t3.t4=t3/t4)

Sample output

      user.self   elapsed
t2      2.93000   2.95000
t3      0.22000   0.22000
t4      0.01000   0.01000
t2.t3  13.31818  13.40909
t2.t4 293.00000 295.00000
t3.t4  22.00000  22.00000

A factor of 300 over the initial solution or 20+ over the other loop based one.

Downside, it needs an extra package loaded, but 'zoo' is rather common place.

Rui Barradas



Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

Joshua Wiley-2
Good points, Rui.

On Mon, Jan 2, 2012 at 12:48 PM, Rui Barradas <[hidden email]> wrote:
> Hello again,
>
> I believe we are all missing something. Isn't it possible to have NAs as the
> first values of 'y'?
> And isn't it also possible to have x[1] > 3?

Theoretically, yes, in the OPs data, maybe?  If the data is a time
series (or time series like), the zoo package is not a bad environment
to be working in anyways.  There are all sorts of handy functions (I
had almost recommended na.approx() which replaces NAs with a linear
interpolation) based on the OPs little example dataset.  Not sure if
the +2 thing is just an attempt at interpolation though or something
more general.

>
> Here is my point (I have changed function 'f2' to predict for such cases,
> 'f1' is rubbish)
>
> # Rui
> f3 <- function(x, y){
>        inx <- which(x > 3)
>        ynx <- which(is.na(y))
>        for(i in which(inx %in% ynx)) y[ynx[i]] <- y[ynx[i]-1] + 2L
>        y
> }
>
> # Jim's, as a function, 'na.rm' option added or else 'df3' would produce an
> error
> require(zoo)
> f4 <- function(x, y){
>        y <- na.locf(y, na.rm=FALSE)
>        inc <- cumsum(x > 3) * 2
>        y + inc
> }
>
> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
> df
> df2 <- data.frame(x = c(1,2,3,4,5), y = c(10,20,NA,40,NA))
> df2
> df3 <- data.frame(x = c(1,2,3,4,5), y = rev(c(10,20,30,NA,NA)))
> df3
>
> # Joshua
> f(df$x, df$y)      # works
> f(df2$x, df2$y)    # infinite loop
> f(df3$x, df3$y)    # infinite loop
>
> # Rui
> f3(df$x, df$y)     # works
> f3(df2$x, df2$y)   # works as expected?
> f3(df3$x, df3$y)   # works as expected?
>
> # Jim
> f4(df$x, df$y)     # works
> f4(df2$x, df2$y)   # works as expected?
> f4(df3$x, df3$y)   # works as expected?
>
> If this makes sense, the performance tests are very much in favour of Jim's
> solution.
>
>
> # If this is what is asked for, test the performance
> # with large enough N
> N <- 1.e5
> dftest <- data.frame(x=1:N, y=c(sample(c(rep(NA, 5), 10*1:5), N,
> replace=TRUE)))
>
> sum(is.na(dftest))/N    # proportion of NAs in 'dftest'
>
> t2 <- system.time(invisible(apply(dftest, 2, f2)))[c(1, 3)]
> t3 <- system.time(invisible(f3(dftest$x, dftest$y)))[c(1, 3)]
> t4 <- system.time(invisible(f4(dftest$x, dftest$y)))[c(1, 3)]
> rbind(t2=t2, t3=t3, t4=t4, t2.t3=t2/t3, t2.t4=t2/t4, t3.t4=t3/t4)
>
> Sample output
>
>      user.self   elapsed
> t2      2.93000   2.95000
> t3      0.22000   0.22000
> t4      0.01000   0.01000
> t2.t3  13.31818  13.40909
> t2.t4 293.00000 295.00000
> t3.t4  22.00000  22.00000
>
> A factor of 300 over the initial solution or 20+ over the other loop based
> one.
>
> Downside, it needs an extra package loaded, but 'zoo' is rather common
> place.
>
> Rui Barradas
>
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Conditionally-adding-a-constant-tp4253049p4254470.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Conditionally adding a constant

jholtman
If you are worried about an NA in the first, then use the following:

> y <- c(NA, 1, 2, NA, 4, NA)
> y <- na.locf(y, na.rm = FALSE)
> y
[1] NA  1  2  2  4  4
> y <- na.locf(y, fromLast = TRUE)
> y
[1] 1 1 2 2 4 4
>


On Mon, Jan 2, 2012 at 5:07 PM, Joshua Wiley <[hidden email]> wrote:

> Good points, Rui.
>
> On Mon, Jan 2, 2012 at 12:48 PM, Rui Barradas <[hidden email]> wrote:
>> Hello again,
>>
>> I believe we are all missing something. Isn't it possible to have NAs as the
>> first values of 'y'?
>> And isn't it also possible to have x[1] > 3?
>
> Theoretically, yes, in the OPs data, maybe?  If the data is a time
> series (or time series like), the zoo package is not a bad environment
> to be working in anyways.  There are all sorts of handy functions (I
> had almost recommended na.approx() which replaces NAs with a linear
> interpolation) based on the OPs little example dataset.  Not sure if
> the +2 thing is just an attempt at interpolation though or something
> more general.
>
>>
>> Here is my point (I have changed function 'f2' to predict for such cases,
>> 'f1' is rubbish)
>>
>> # Rui
>> f3 <- function(x, y){
>>        inx <- which(x > 3)
>>        ynx <- which(is.na(y))
>>        for(i in which(inx %in% ynx)) y[ynx[i]] <- y[ynx[i]-1] + 2L
>>        y
>> }
>>
>> # Jim's, as a function, 'na.rm' option added or else 'df3' would produce an
>> error
>> require(zoo)
>> f4 <- function(x, y){
>>        y <- na.locf(y, na.rm=FALSE)
>>        inc <- cumsum(x > 3) * 2
>>        y + inc
>> }
>>
>> df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
>> df
>> df2 <- data.frame(x = c(1,2,3,4,5), y = c(10,20,NA,40,NA))
>> df2
>> df3 <- data.frame(x = c(1,2,3,4,5), y = rev(c(10,20,30,NA,NA)))
>> df3
>>
>> # Joshua
>> f(df$x, df$y)      # works
>> f(df2$x, df2$y)    # infinite loop
>> f(df3$x, df3$y)    # infinite loop
>>
>> # Rui
>> f3(df$x, df$y)     # works
>> f3(df2$x, df2$y)   # works as expected?
>> f3(df3$x, df3$y)   # works as expected?
>>
>> # Jim
>> f4(df$x, df$y)     # works
>> f4(df2$x, df2$y)   # works as expected?
>> f4(df3$x, df3$y)   # works as expected?
>>
>> If this makes sense, the performance tests are very much in favour of Jim's
>> solution.
>>
>>
>> # If this is what is asked for, test the performance
>> # with large enough N
>> N <- 1.e5
>> dftest <- data.frame(x=1:N, y=c(sample(c(rep(NA, 5), 10*1:5), N,
>> replace=TRUE)))
>>
>> sum(is.na(dftest))/N    # proportion of NAs in 'dftest'
>>
>> t2 <- system.time(invisible(apply(dftest, 2, f2)))[c(1, 3)]
>> t3 <- system.time(invisible(f3(dftest$x, dftest$y)))[c(1, 3)]
>> t4 <- system.time(invisible(f4(dftest$x, dftest$y)))[c(1, 3)]
>> rbind(t2=t2, t3=t3, t4=t4, t2.t3=t2/t3, t2.t4=t2/t4, t3.t4=t3/t4)
>>
>> Sample output
>>
>>      user.self   elapsed
>> t2      2.93000   2.95000
>> t3      0.22000   0.22000
>> t4      0.01000   0.01000
>> t2.t3  13.31818  13.40909
>> t2.t4 293.00000 295.00000
>> t3.t4  22.00000  22.00000
>>
>> A factor of 300 over the initial solution or 20+ over the other loop based
>> one.
>>
>> Downside, it needs an extra package loaded, but 'zoo' is rather common
>> place.
>>
>> Rui Barradas
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Conditionally-adding-a-constant-tp4253049p4254470.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.