Quantcast

lagging over consecutive pairs of rows in dataframe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

lagging over consecutive pairs of rows in dataframe

Evan Cooch
Suppose I have a dataframe that looks like the following:

n=2
mydata <- data.frame(exp = rep(1:5,each=n), rslt =
c(12,15,7,8,24,28,33,15,22,11))
mydata
    exp rslt
1    1   12
2    1   15
3    2    7
4    2    8
5    3   24
6    3   28
7    4   33
8    4   15
9    5   22
10   5   11

The variable 'exp' (for experiment') occurs in pairs over consecutive
rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
the 'control', and the second is a 'treatment'. The rslt column is the
result.

What I'm trying to do is create a subset of this dataframe that consists
of the exp number, and the lagged difference between the 'control' and
'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
take mydata (above), and turn it into

      exp  diff
1   1      3
2   2      1
3   3      4
4   4      -18
5   5      -11

The basic 'trick' I can't figure out is how to create a lagged variable
between the second row (record) for a given level of exp, and the first
row for that exp.  This is easy to do in SAS (which I'm more familiar
with), but I'm struggling with the equivalent in R. The brute force
approach  I thought of is to simply split the dataframe into to (one
even rows, one odd rows), merge by exp, and then calculate a difference.
But this seems to require renaming the rslt column in the two new
dataframes so they are different in the merge (say, rslt_cont n the odd
dataframe, and rslt_trt in the even dataframe), allowing me to calculate
a difference between the two.

While I suppose this would work, I'm wondering if I'm missing a more
elegant 'in place' approach that doesn't require me to split the data
frame and do every via a merge.

Suggestions/pointers to the obvious welcome. I've tried playing with
lag, and some approaches using lag in the zoo package,  but haven't
found the magic trick. The problem (meaning, what I can't figure out)
seems to be conditioning the lag on the level of exp.

Many thanks...


mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44))



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Ulrik Stervbo-2
Hi Evan

you can easily do this by applying diff() to each exp group.

Either using dplyr:
library(dplyr)
mydata %>%
  group_by(exp) %>%
  summarise(difference = diff(rslt))

Or with base R
aggregate(mydata, by = list(group = mydata$exp), FUN = diff)

HTH
Ulrik


On Fri, 17 Mar 2017 at 17:34 Evan Cooch <[hidden email]> wrote:

> Suppose I have a dataframe that looks like the following:
>
> n=2
> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
> c(12,15,7,8,24,28,33,15,22,11))
> mydata
>     exp rslt
> 1    1   12
> 2    1   15
> 3    2    7
> 4    2    8
> 5    3   24
> 6    3   28
> 7    4   33
> 8    4   15
> 9    5   22
> 10   5   11
>
> The variable 'exp' (for experiment') occurs in pairs over consecutive
> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
> the 'control', and the second is a 'treatment'. The rslt column is the
> result.
>
> What I'm trying to do is create a subset of this dataframe that consists
> of the exp number, and the lagged difference between the 'control' and
> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
> take mydata (above), and turn it into
>
>       exp  diff
> 1   1      3
> 2   2      1
> 3   3      4
> 4   4      -18
> 5   5      -11
>
> The basic 'trick' I can't figure out is how to create a lagged variable
> between the second row (record) for a given level of exp, and the first
> row for that exp.  This is easy to do in SAS (which I'm more familiar
> with), but I'm struggling with the equivalent in R. The brute force
> approach  I thought of is to simply split the dataframe into to (one
> even rows, one odd rows), merge by exp, and then calculate a difference.
> But this seems to require renaming the rslt column in the two new
> dataframes so they are different in the merge (say, rslt_cont n the odd
> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
> a difference between the two.
>
> While I suppose this would work, I'm wondering if I'm missing a more
> elegant 'in place' approach that doesn't require me to split the data
> frame and do every via a merge.
>
> Suggestions/pointers to the obvious welcome. I've tried playing with
> lag, and some approaches using lag in the zoo package,  but haven't
> found the magic trick. The problem (meaning, what I can't figure out)
> seems to be conditioning the lag on the level of exp.
>
> Many thanks...
>
>
> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y =
> c(6,17,26,37,44))
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Evan Cooch


On 3/17/2017 12:58 PM, Ulrik Stervbo wrote:

> Hi Evan
>
> you can easily do this by applying diff() to each exp group.
>
> Either using dplyr:
> library(dplyr)
> mydata %>%
>   group_by(exp) %>%
>   summarise(difference = diff(rslt))
>
> Or with base R
> aggregate(mydata, by = list(group = mydata$exp), FUN = diff)
>
>


Indeed -- thanks very much!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Bert Gunter-2
In reply to this post by Ulrik Stervbo-2
Evan:

You misunderstand the concept of a lagged variable.

Ulrik:

Well, yes, that is certainly a general solution that works. However,
given the *specific* structure described by the OP, an even more
direct (maybe more efficient?) way to do it just uses (logical)
subscripting:

odds <-  (seq_len(nrow(mydata)) %% 2) == 1
newdat <-data.frame(mydata[odds,1 ],mydata[!odds,2] - mydata[odds,2])
names(newdat) <- names(mydata)

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Mar 17, 2017 at 9:58 AM, Ulrik Stervbo <[hidden email]> wrote:

> Hi Evan
>
> you can easily do this by applying diff() to each exp group.
>
> Either using dplyr:
> library(dplyr)
> mydata %>%
>   group_by(exp) %>%
>   summarise(difference = diff(rslt))
>
> Or with base R
> aggregate(mydata, by = list(group = mydata$exp), FUN = diff)
>
> HTH
> Ulrik
>
>
> On Fri, 17 Mar 2017 at 17:34 Evan Cooch <[hidden email]> wrote:
>
>> Suppose I have a dataframe that looks like the following:
>>
>> n=2
>> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
>> c(12,15,7,8,24,28,33,15,22,11))
>> mydata
>>     exp rslt
>> 1    1   12
>> 2    1   15
>> 3    2    7
>> 4    2    8
>> 5    3   24
>> 6    3   28
>> 7    4   33
>> 8    4   15
>> 9    5   22
>> 10   5   11
>>
>> The variable 'exp' (for experiment') occurs in pairs over consecutive
>> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
>> the 'control', and the second is a 'treatment'. The rslt column is the
>> result.
>>
>> What I'm trying to do is create a subset of this dataframe that consists
>> of the exp number, and the lagged difference between the 'control' and
>> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
>> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
>> take mydata (above), and turn it into
>>
>>       exp  diff
>> 1   1      3
>> 2   2      1
>> 3   3      4
>> 4   4      -18
>> 5   5      -11
>>
>> The basic 'trick' I can't figure out is how to create a lagged variable
>> between the second row (record) for a given level of exp, and the first
>> row for that exp.  This is easy to do in SAS (which I'm more familiar
>> with), but I'm struggling with the equivalent in R. The brute force
>> approach  I thought of is to simply split the dataframe into to (one
>> even rows, one odd rows), merge by exp, and then calculate a difference.
>> But this seems to require renaming the rslt column in the two new
>> dataframes so they are different in the merge (say, rslt_cont n the odd
>> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
>> a difference between the two.
>>
>> While I suppose this would work, I'm wondering if I'm missing a more
>> elegant 'in place' approach that doesn't require me to split the data
>> frame and do every via a merge.
>>
>> Suggestions/pointers to the obvious welcome. I've tried playing with
>> lag, and some approaches using lag in the zoo package,  but haven't
>> found the magic trick. The problem (meaning, what I can't figure out)
>> seems to be conditioning the lag on the level of exp.
>>
>> Many thanks...
>>
>>
>> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y =
>> c(6,17,26,37,44))
>>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Evan Cooch


On 3/17/2017 1:19 PM, Bert Gunter wrote:
> Evan:
>
> You misunderstand the concept of a lagged variable.

Well, lag in R, perhaps (and by my own admission). In SAS, thats exactly
how it works.:

data test;
input exp rslt;
cards;
<data in the data frame in OP>
     *;


     data test2; set test; by exp;
     diff=rslt-lag(rslt);
       if last.exp;

>
> Ulrik:
>
> Well, yes, that is certainly a general solution that works. However,
> given the *specific* structure described by the OP, an even more
> direct (maybe more efficient?) way to do it just uses (logical)
> subscripting:
>
> odds <-  (seq_len(nrow(mydata)) %% 2) == 1
> newdat <-data.frame(mydata[odds,1 ],mydata[!odds,2] - mydata[odds,2])
> names(newdat) <- names(mydata)
>

Interesting - thanks!

>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Mar 17, 2017 at 9:58 AM, Ulrik Stervbo <[hidden email]> wrote:
>> Hi Evan
>>
>> you can easily do this by applying diff() to each exp group.
>>
>> Either using dplyr:
>> library(dplyr)
>> mydata %>%
>>    group_by(exp) %>%
>>    summarise(difference = diff(rslt))
>>
>> Or with base R
>> aggregate(mydata, by = list(group = mydata$exp), FUN = diff)
>>
>> HTH
>> Ulrik
>>
>>
>> On Fri, 17 Mar 2017 at 17:34 Evan Cooch <[hidden email]> wrote:
>>
>>> Suppose I have a dataframe that looks like the following:
>>>
>>> n=2
>>> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
>>> c(12,15,7,8,24,28,33,15,22,11))
>>> mydata
>>>      exp rslt
>>> 1    1   12
>>> 2    1   15
>>> 3    2    7
>>> 4    2    8
>>> 5    3   24
>>> 6    3   28
>>> 7    4   33
>>> 8    4   15
>>> 9    5   22
>>> 10   5   11
>>>
>>> The variable 'exp' (for experiment') occurs in pairs over consecutive
>>> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
>>> the 'control', and the second is a 'treatment'. The rslt column is the
>>> result.
>>>
>>> What I'm trying to do is create a subset of this dataframe that consists
>>> of the exp number, and the lagged difference between the 'control' and
>>> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
>>> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
>>> take mydata (above), and turn it into
>>>
>>>        exp  diff
>>> 1   1      3
>>> 2   2      1
>>> 3   3      4
>>> 4   4      -18
>>> 5   5      -11
>>>
>>> The basic 'trick' I can't figure out is how to create a lagged variable
>>> between the second row (record) for a given level of exp, and the first
>>> row for that exp.  This is easy to do in SAS (which I'm more familiar
>>> with), but I'm struggling with the equivalent in R. The brute force
>>> approach  I thought of is to simply split the dataframe into to (one
>>> even rows, one odd rows), merge by exp, and then calculate a difference.
>>> But this seems to require renaming the rslt column in the two new
>>> dataframes so they are different in the merge (say, rslt_cont n the odd
>>> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
>>> a difference between the two.
>>>
>>> While I suppose this would work, I'm wondering if I'm missing a more
>>> elegant 'in place' approach that doesn't require me to split the data
>>> frame and do every via a merge.
>>>
>>> Suggestions/pointers to the obvious welcome. I've tried playing with
>>> lag, and some approaches using lag in the zoo package,  but haven't
>>> found the magic trick. The problem (meaning, what I can't figure out)
>>> seems to be conditioning the lag on the level of exp.
>>>
>>> Many thanks...
>>>
>>>
>>> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y =
>>> c(6,17,26,37,44))
>>>
>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Bert Gunter-2
Evan:

Yes, I stand partially corrected. You have the concept correct, but R
implements it differently than SAS.

I think what you want for your approach is diff():

evens <-  (seq_len(nrow(mydata)) %% 2) == 0
newdat <-data.frame(exp=mydata[evens,1 ],reslt= diff(mydata[,2])[evens[-1]])

... which seems neater to me than what I offered previously.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Mar 17, 2017 at 10:25 AM, Evan Cooch <[hidden email]> wrote:

>
>
> On 3/17/2017 1:19 PM, Bert Gunter wrote:
>>
>> Evan:
>>
>> You misunderstand the concept of a lagged variable.
>
>
> Well, lag in R, perhaps (and by my own admission). In SAS, thats exactly how
> it works.:
>
> data test;
> input exp rslt;
> cards;
> <data in the data frame in OP>
>     *;
>
>
>     data test2; set test; by exp;
>     diff=rslt-lag(rslt);
>       if last.exp;
>
>>
>> Ulrik:
>>
>> Well, yes, that is certainly a general solution that works. However,
>> given the *specific* structure described by the OP, an even more
>> direct (maybe more efficient?) way to do it just uses (logical)
>> subscripting:
>>
>> odds <-  (seq_len(nrow(mydata)) %% 2) == 1
>> newdat <-data.frame(mydata[odds,1 ],mydata[!odds,2] - mydata[odds,2])
>> names(newdat) <- names(mydata)
>>
>
> Interesting - thanks!
>
>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Fri, Mar 17, 2017 at 9:58 AM, Ulrik Stervbo <[hidden email]>
>> wrote:
>>>
>>> Hi Evan
>>>
>>> you can easily do this by applying diff() to each exp group.
>>>
>>> Either using dplyr:
>>> library(dplyr)
>>> mydata %>%
>>>    group_by(exp) %>%
>>>    summarise(difference = diff(rslt))
>>>
>>> Or with base R
>>> aggregate(mydata, by = list(group = mydata$exp), FUN = diff)
>>>
>>> HTH
>>> Ulrik
>>>
>>>
>>> On Fri, 17 Mar 2017 at 17:34 Evan Cooch <[hidden email]> wrote:
>>>
>>>> Suppose I have a dataframe that looks like the following:
>>>>
>>>> n=2
>>>> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
>>>> c(12,15,7,8,24,28,33,15,22,11))
>>>> mydata
>>>>      exp rslt
>>>> 1    1   12
>>>> 2    1   15
>>>> 3    2    7
>>>> 4    2    8
>>>> 5    3   24
>>>> 6    3   28
>>>> 7    4   33
>>>> 8    4   15
>>>> 9    5   22
>>>> 10   5   11
>>>>
>>>> The variable 'exp' (for experiment') occurs in pairs over consecutive
>>>> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
>>>> the 'control', and the second is a 'treatment'. The rslt column is the
>>>> result.
>>>>
>>>> What I'm trying to do is create a subset of this dataframe that consists
>>>> of the exp number, and the lagged difference between the 'control' and
>>>> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
>>>> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
>>>> take mydata (above), and turn it into
>>>>
>>>>        exp  diff
>>>> 1   1      3
>>>> 2   2      1
>>>> 3   3      4
>>>> 4   4      -18
>>>> 5   5      -11
>>>>
>>>> The basic 'trick' I can't figure out is how to create a lagged variable
>>>> between the second row (record) for a given level of exp, and the first
>>>> row for that exp.  This is easy to do in SAS (which I'm more familiar
>>>> with), but I'm struggling with the equivalent in R. The brute force
>>>> approach  I thought of is to simply split the dataframe into to (one
>>>> even rows, one odd rows), merge by exp, and then calculate a difference.
>>>> But this seems to require renaming the rslt column in the two new
>>>> dataframes so they are different in the merge (say, rslt_cont n the odd
>>>> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
>>>> a difference between the two.
>>>>
>>>> While I suppose this would work, I'm wondering if I'm missing a more
>>>> elegant 'in place' approach that doesn't require me to split the data
>>>> frame and do every via a merge.
>>>>
>>>> Suggestions/pointers to the obvious welcome. I've tried playing with
>>>> lag, and some approaches using lag in the zoo package,  but haven't
>>>> found the magic trick. The problem (meaning, what I can't figure out)
>>>> seems to be conditioning the lag on the level of exp.
>>>>
>>>> Many thanks...
>>>>
>>>>
>>>> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y =
>>>> c(6,17,26,37,44))
>>>>
>>>>
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Evan Cooch
Thanks very much. I suspect 50% of my time in R is spent translating
from what I know how to do in SAS (25+ years of heavy use), to what is
equivalent in SAS. So far, I haven't found anything I can do in SAS that
I can't do in R, with some help. ;-)

Cheers...

On 3/17/2017 1:51 PM, Bert Gunter wrote:

> Evan:
>
> Yes, I stand partially corrected. You have the concept correct, but R
> implements it differently than SAS.
>
> I think what you want for your approach is diff():
>
> evens <-  (seq_len(nrow(mydata)) %% 2) == 0
> newdat <-data.frame(exp=mydata[evens,1 ],reslt= diff(mydata[,2])[evens[-1]])
>
> ... which seems neater to me than what I offered previously.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Mar 17, 2017 at 10:25 AM, Evan Cooch <[hidden email]> wrote:
>>
>> On 3/17/2017 1:19 PM, Bert Gunter wrote:
>>> Evan:
>>>
>>> You misunderstand the concept of a lagged variable.
>>
>> Well, lag in R, perhaps (and by my own admission). In SAS, thats exactly how
>> it works.:
>>
>> data test;
>> input exp rslt;
>> cards;
>> <data in the data frame in OP>
>>      *;
>>
>>
>>      data test2; set test; by exp;
>>      diff=rslt-lag(rslt);
>>        if last.exp;
>>
>>> Ulrik:
>>>
>>> Well, yes, that is certainly a general solution that works. However,
>>> given the *specific* structure described by the OP, an even more
>>> direct (maybe more efficient?) way to do it just uses (logical)
>>> subscripting:
>>>
>>> odds <-  (seq_len(nrow(mydata)) %% 2) == 1
>>> newdat <-data.frame(mydata[odds,1 ],mydata[!odds,2] - mydata[odds,2])
>>> names(newdat) <- names(mydata)
>>>
>> Interesting - thanks!
>>
>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Fri, Mar 17, 2017 at 9:58 AM, Ulrik Stervbo <[hidden email]>
>>> wrote:
>>>> Hi Evan
>>>>
>>>> you can easily do this by applying diff() to each exp group.
>>>>
>>>> Either using dplyr:
>>>> library(dplyr)
>>>> mydata %>%
>>>>     group_by(exp) %>%
>>>>     summarise(difference = diff(rslt))
>>>>
>>>> Or with base R
>>>> aggregate(mydata, by = list(group = mydata$exp), FUN = diff)
>>>>
>>>> HTH
>>>> Ulrik
>>>>
>>>>
>>>> On Fri, 17 Mar 2017 at 17:34 Evan Cooch <[hidden email]> wrote:
>>>>
>>>>> Suppose I have a dataframe that looks like the following:
>>>>>
>>>>> n=2
>>>>> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
>>>>> c(12,15,7,8,24,28,33,15,22,11))
>>>>> mydata
>>>>>       exp rslt
>>>>> 1    1   12
>>>>> 2    1   15
>>>>> 3    2    7
>>>>> 4    2    8
>>>>> 5    3   24
>>>>> 6    3   28
>>>>> 7    4   33
>>>>> 8    4   15
>>>>> 9    5   22
>>>>> 10   5   11
>>>>>
>>>>> The variable 'exp' (for experiment') occurs in pairs over consecutive
>>>>> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
>>>>> the 'control', and the second is a 'treatment'. The rslt column is the
>>>>> result.
>>>>>
>>>>> What I'm trying to do is create a subset of this dataframe that consists
>>>>> of the exp number, and the lagged difference between the 'control' and
>>>>> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
>>>>> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
>>>>> take mydata (above), and turn it into
>>>>>
>>>>>         exp  diff
>>>>> 1   1      3
>>>>> 2   2      1
>>>>> 3   3      4
>>>>> 4   4      -18
>>>>> 5   5      -11
>>>>>
>>>>> The basic 'trick' I can't figure out is how to create a lagged variable
>>>>> between the second row (record) for a given level of exp, and the first
>>>>> row for that exp.  This is easy to do in SAS (which I'm more familiar
>>>>> with), but I'm struggling with the equivalent in R. The brute force
>>>>> approach  I thought of is to simply split the dataframe into to (one
>>>>> even rows, one odd rows), merge by exp, and then calculate a difference.
>>>>> But this seems to require renaming the rslt column in the two new
>>>>> dataframes so they are different in the merge (say, rslt_cont n the odd
>>>>> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
>>>>> a difference between the two.
>>>>>
>>>>> While I suppose this would work, I'm wondering if I'm missing a more
>>>>> elegant 'in place' approach that doesn't require me to split the data
>>>>> frame and do every via a merge.
>>>>>
>>>>> Suggestions/pointers to the obvious welcome. I've tried playing with
>>>>> lag, and some approaches using lag in the zoo package,  but haven't
>>>>> found the magic trick. The problem (meaning, what I can't figure out)
>>>>> seems to be conditioning the lag on the level of exp.
>>>>>
>>>>> Many thanks...
>>>>>
>>>>>
>>>>> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y =
>>>>> c(6,17,26,37,44))
>>>>>
>>>>>
>>>>>
>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: lagging over consecutive pairs of rows in dataframe

Mathew Guilfoyle
In reply to this post by Evan Cooch
If you are strict about your data formatting then the following is a fast way of calculating the differences, based on reshaping the data column:

A = matrix(mydata$rslt, nrow=2)
data.frame(exp=1:ncol(A), diff=A[2,]-A[1,])

alternatively, if the 'exp' values are not guaranteed to be sequential you can reshape an index:

A = matrix(1:nrow(mydata), nrow=2)
data.frame(exp=mydata$exp[A[1,]], diff=mydata$rslt[A[2,]]-mydata$rslt[A[1,]])

However, I would suggest that you have a further variable to label 'control' and 'treatment' groups and explicitly use this for the calculation.  Otherwise, if at any time you sort or reorder the data you will run into problems or produce erroneous results (but more than likely won't generate any actual R errors to alert you):

data.frame(exp = unique(mydata$exp), diff = as.vector(by(mydata, mydata$exp, function(x) x$rslt[x$type=='treatment']-x$rslt[x$type=='control'])))

The efficiency of the various options that have been suggested in this thread piqued my interest so a quick benchmark seemed in order (see below, including a 'safer' method).  Of course, this is probably only relevant if you have huge datasets that you are repeatedly performing this calculation on.


library('microbenchmark')

#create some example data similar to the OP
ndata = 1000
mydata = data.frame(exp = cumsum(rep(c(1,0),ndata)),rslt=sample(1:50, size = ndata*2, replace = TRUE), type=rep(c('control','treatment'),ndata))

#various suggested options
diff.BG = function(mydata) {
  evens <-  (seq_len(nrow(mydata)) %% 2) == 0
  data.frame(exp = mydata[evens,1 ], diff = diff(mydata[,2])[evens[-1]])
}

diff.US = function(mydata) {
  aggregate(mydata$rslt, by = list(group = mydata$exp), FUN = diff)
}

diff.MG1 = function(mydata) {
  A = matrix(mydata$rslt, nrow=2)
  data.frame(exp=1:ncol(A), diff=A[2,]-A[1,])
}

diff.MG2 = function(mydata) {
  A = matrix(1:nrow(mydata), nrow=2)
  data.frame(exp=mydata$exp[A[1,]], diff=mydata$rslt[A[2,]]-mydata$rslt[A[1,]])
}

diff.safe = function(mydata) {
  data.frame(exp = unique(mydata$exp), diff = as.vector(by(mydata, mydata$exp, function(x) x$rslt[x$type=='treatment']-x$rslt[x$type=='control'])))
}

#benchmark
microbenchmark(BG=diff.BG(mydata), US=diff.US(mydata), MG1=diff.MG1(mydata), MG2=diff.MG2(mydata), safe=diff.safe(my data))

Unit: microseconds
 expr       min          lq        mean      median          uq        max neval
   BG   273.837    299.0015    351.0377    316.7400    350.5220   2385.289   100
   US  9872.457  10511.1065  11555.6048  11108.0790  12471.8060  17609.518   100
  MG1   168.783    196.8635    229.9329    210.9370    249.4895    471.381   100
  MG2   221.303    237.0480    265.5097    254.3895    280.7815    418.728   100
 safe 97869.540 104164.5130 109579.9834 107199.7715 110315.8590 170028.377   100



> On 17 Mar 2017, at 14:54, Evan Cooch <[hidden email]> wrote:
>
> Suppose I have a dataframe that looks like the following:
>
> n=2
> mydata <- data.frame(exp = rep(1:5,each=n), rslt =
> c(12,15,7,8,24,28,33,15,22,11))
> mydata
>    exp rslt
> 1    1   12
> 2    1   15
> 3    2    7
> 4    2    8
> 5    3   24
> 6    3   28
> 7    4   33
> 8    4   15
> 9    5   22
> 10   5   11
>
> The variable 'exp' (for experiment') occurs in pairs over consecutive
> rows -- 1,1, then 2,2, then 3,3, and so on. The first row in a pair is
> the 'control', and the second is a 'treatment'. The rslt column is the
> result.
>
> What I'm trying to do is create a subset of this dataframe that consists
> of the exp number, and the lagged difference between the 'control' and
> 'treatment' result.  So, for exp=1, the difference is (15-12)=3. For
> exp=2,  the difference is (8-7)=1, and so on. What I'm hoping to do is
> take mydata (above), and turn it into
>
>      exp  diff
> 1   1      3
> 2   2      1
> 3   3      4
> 4   4      -18
> 5   5      -11
>
> The basic 'trick' I can't figure out is how to create a lagged variable
> between the second row (record) for a given level of exp, and the first
> row for that exp.  This is easy to do in SAS (which I'm more familiar
> with), but I'm struggling with the equivalent in R. The brute force
> approach  I thought of is to simply split the dataframe into to (one
> even rows, one odd rows), merge by exp, and then calculate a difference.
> But this seems to require renaming the rslt column in the two new
> dataframes so they are different in the merge (say, rslt_cont n the odd
> dataframe, and rslt_trt in the even dataframe), allowing me to calculate
> a difference between the two.
>
> While I suppose this would work, I'm wondering if I'm missing a more
> elegant 'in place' approach that doesn't require me to split the data
> frame and do every via a merge.
>
> Suggestions/pointers to the obvious welcome. I've tried playing with
> lag, and some approaches using lag in the zoo package,  but haven't
> found the magic trick. The problem (meaning, what I can't figure out)
> seems to be conditioning the lag on the level of exp.
>
> Many thanks...
>
>
> mydata <-*data.frame*(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44))
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...