how to subset based on other row values and multiplicity

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

how to subset based on other row values and multiplicity

Williams Scott
Hi R experts,
 
I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
in an individual (Œid¹) if they occur
more than once at least 30 days apart.

 
id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w

 
I wish to subset the data to retain rows where the value for the
individual is confirmed more than 30 days apart. So, after deleting all
rows with just one occurrence of id and value, the rest would be the
earliest occurrence of each value in each case id, provided 31 or more
days exist between the dates. If >1 value is present per id, each value
level needs to be assessed independently. This example would then reduce
to:

 
id   date           value
a    2000-01-01 x
c    2000-09-10 y
c    2000-10-11 z

 
 
I can do this via some crude loops and subsetting, but I am looking for as
much efficiency as possible
as the dataset has around 50 million rows to assess. Any suggestions
welcomed.

Thanks in advance
 
Scott Williams MD
Melbourne, Australia



This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

John McKown
On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
<[hidden email]> wrote:

> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z

Question: the c-y id-value pair occurs 3 times. In two cases
(2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
is >30 days. Why isn't
c 2000-10-01 y
also part of the result? Is it because you only want a single id-value
pair in which the date is the minimal? Or you want the one in which
the date difference is maximal? Or you overlooked that particular
match? I can't figure it out from your description.

>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.

Hum, is the source of this data in a relational database such as
Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
because some of this processing might be easier do to in the data base
using a "self join", instead of reading the entire relational table
into a data.frame and doing it in R.

>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

jholtman
I can reproduce what you requested, but there was the question about
what happens with the multiple 'c-y' values.

====================

> require(data.table)
> x <- read.table(text = 'id   date value
+ a    2000-01-01 x
+ a    2000-03-01 x
+ b    2000-11-11 w
+ c    2000-11-11 y
+ c    2000-10-01 y
+ c    2000-09-10 y
+ c    2000-12-12 z
+ c    2000-10-11 z
+ d    2000-11-11 w
+ d    2000-11-10 w', as.is = TRUE, header = TRUE)
> setDT(x)
> x[, date := as.Date(date)]
> setkey(x, id, value, date)
>
> y <- x[
+     , {
+         if (.N == 1) val <- NULL  # only one -- delete
+         else {
+             dif <- difftime(tail(date, -1), head(date, -1), units = 'days')
+             # return first value if any > 31
+             if (any(dif >= 31)) val <- list(date = date[1L])
+             else val <- NULL
+         }
+         val
+       }
+     , keyby = 'id,value'
+     ]
> y
   id value       date
1:  a     x 2000-01-01
2:  c     y 2000-09-10
3:  c     z 2000-10-11

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Jul 16, 2014 at 9:25 AM, John McKown
<[hidden email]> wrote:

> On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
> <[hidden email]> wrote:
>> Hi R experts,
>>
>> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
>> in an individual (Œid¹) if they occur
>> more than once at least 30 days apart.
>>
>>
>> id   date value
>> a    2000-01-01 x
>> a    2000-03-01 x
>> b    2000-11-11 w
>> c    2000-11-11 y
>> c    2000-10-01 y
>> c    2000-09-10 y
>> c    2000-12-12 z
>> c    2000-10-11 z
>> d    2000-11-11 w
>> d    2000-11-10 w
>>
>>
>> I wish to subset the data to retain rows where the value for the
>> individual is confirmed more than 30 days apart. So, after deleting all
>> rows with just one occurrence of id and value, the rest would be the
>> earliest occurrence of each value in each case id, provided 31 or more
>> days exist between the dates. If >1 value is present per id, each value
>> level needs to be assessed independently. This example would then reduce
>> to:
>>
>>
>> id   date           value
>> a    2000-01-01 x
>> c    2000-09-10 y
>> c    2000-10-11 z
>
> Question: the c-y id-value pair occurs 3 times. In two cases
> (2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
> is >30 days. Why isn't
> c 2000-10-01 y
> also part of the result? Is it because you only want a single id-value
> pair in which the date is the minimal? Or you want the one in which
> the date difference is maximal? Or you overlooked that particular
> match? I can't figure it out from your description.
>
>>
>>
>>
>> I can do this via some crude loops and subsetting, but I am looking for as
>> much efficiency as possible
>> as the dataset has around 50 million rows to assess. Any suggestions
>> welcomed.
>
> Hum, is the source of this data in a relational database such as
> Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
> because some of this processing might be easier do to in the data base
> using a "self join", instead of reading the entire relational table
> into a data.frame and doing it in R.
>
>>
>> Thanks in advance
>>
>> Scott Williams MD
>> Melbourne, Australia
>>
>
> --
> There is nothing more pleasant than traveling and meeting new people!
> Genghis Khan
>
> Maranatha! <><
> John McKown
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

John McKown
In reply to this post by John McKown
Thanks. So you only want a single entry with a given "id" & "value",
even if there are multiple possible confirmations.

Too bad about not being in an SQL data base. I've already partially
solved the problem using PostgreSQL. Just in case you, or others,
might be interested, below is a transcript of what I have. The SQL
might suggest a possible approach in native R.

<transcript>
tsh009=# select * from datedata;
 id |    date    | value
----+------------+-------
 a  | 2000-01-01 | x
 a  | 2000-03-01 | x
 b  | 2000-11-11 | w
 c  | 2000-11-11 | y
 c  | 2000-10-01 | y
 c  | 2000-09-10 | y
 c  | 2000-12-12 | z
 c  | 2000-10-11 | z
 d  | 2000-11-11 | w
 d  | 2000-11-10 | w
(10 rows)

tsh009=# select a.id,a.date,a.value
from datedata as a
join datedata as b
on a.id = b.id and a.value=b.value
where b.date - a.date > 30;
 id |    date    | value
----+------------+-------
 a  | 2000-01-01 | x
 c  | 2000-10-01 | y
 c  | 2000-09-10 | y
 c  | 2000-10-11 | z
(4 rows)

</transcript>

the only problem is the "multiple confirmation" problem because you
only want / need a single c-y confirmation and my code produces all
possible ones.

On Wed, Jul 16, 2014 at 8:38 AM, Williams Scott
<[hidden email]> wrote:

> It probably isn’t that clear John - to put it another way - each patient
> (‘id’) can have multiple diagnosis codes (w -> z in this example, several
> thousand in reality) recorded at multiple times. I just need to find the
> ‘confirmed’ diagnosis code or codes for each patient. To be confirmed they
> have to occur at least twice and at least a month apart. So patient c has
> 2 diagnoses recorded, 1 recorded twice and one thrice; each confirmed by
> multiplicity and time.
>
> The data is delivered as a flat .txt file. I’m not proficient with any
> databases other than MS Access unfortunately, and the 120Gb of data is not
> easily managed in Access.
>
> I hope that helps
> S
>
> On 16/07/2014 11:25 pm, "John McKown" <[hidden email]> wrote:
>
>>On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
>><[hidden email]> wrote:
>>> Hi R experts,
>>>
>>> I have a dataset as sampled below. Values are only regarded as
>>>Œconfirmed¹
>>> in an individual (Œid¹) if they occur
>>> more than once at least 30 days apart.
>>>
>>>
>>> id   date value
>>> a    2000-01-01 x
>>> a    2000-03-01 x
>>> b    2000-11-11 w
>>> c    2000-11-11 y
>>> c    2000-10-01 y
>>> c    2000-09-10 y
>>> c    2000-12-12 z
>>> c    2000-10-11 z
>>> d    2000-11-11 w
>>> d    2000-11-10 w
>>>
>>>
>>> I wish to subset the data to retain rows where the value for the
>>> individual is confirmed more than 30 days apart. So, after deleting all
>>> rows with just one occurrence of id and value, the rest would be the
>>> earliest occurrence of each value in each case id, provided 31 or more
>>> days exist between the dates. If >1 value is present per id, each value
>>> level needs to be assessed independently. This example would then reduce
>>> to:
>>>
>>>
>>> id   date           value
>>> a    2000-01-01 x
>>> c    2000-09-10 y
>>> c    2000-10-11 z
>>
>>Question: the c-y id-value pair occurs 3 times. In two cases
>>(2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
>>is >30 days. Why isn't
>>c 2000-10-01 y
>>also part of the result? Is it because you only want a single id-value
>>pair in which the date is the minimal? Or you want the one in which
>>the date difference is maximal? Or you overlooked that particular
>>match? I can't figure it out from your description.
>>
>>>
>>>
>>>
>>> I can do this via some crude loops and subsetting, but I am looking for
>>>as
>>> much efficiency as possible
>>> as the dataset has around 50 million rows to assess. Any suggestions
>>> welcomed.
>>
>>Hum, is the source of this data in a relational database such as
>>Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
>>because some of this processing might be easier do to in the data base
>>using a "self join", instead of reading the entire relational table
>>into a data.frame and doing it in R.
>>
>>>
>>> Thanks in advance
>>>
>>> Scott Williams MD
>>> Melbourne, Australia
>>>
>>
>>--
>>There is nothing more pleasant than traveling and meeting new people!
>>Genghis Khan
>>
>>Maranatha! <><
>>John McKown
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
>



--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

John McKown
In reply to this post by jholtman
On Wed, Jul 16, 2014 at 8:51 AM, jim holtman <[hidden email]> wrote:

> I can reproduce what you requested, but there was the question about
> what happens with the multiple 'c-y' values.
>
> ====================
>
>> require(data.table)
>> x <- read.table(text = 'id   date value
> + a    2000-01-01 x
> + a    2000-03-01 x
> + b    2000-11-11 w
> + c    2000-11-11 y
> + c    2000-10-01 y
> + c    2000-09-10 y
> + c    2000-12-12 z
> + c    2000-10-11 z
> + d    2000-11-11 w
> + d    2000-11-10 w', as.is = TRUE, header = TRUE)
>> setDT(x)
>> x[, date := as.Date(date)]
>> setkey(x, id, value, date)
>>
>> y <- x[
> +     , {
> +         if (.N == 1) val <- NULL  # only one -- delete
> +         else {
> +             dif <- difftime(tail(date, -1), head(date, -1), units = 'days')
> +             # return first value if any > 31
> +             if (any(dif >= 31)) val <- list(date = date[1L])
> +             else val <- NULL
> +         }
> +         val
> +       }
> +     , keyby = 'id,value'
> +     ]
>> y
>    id value       date
> 1:  a     x 2000-01-01
> 2:  c     y 2000-09-10
> 3:  c     z 2000-10-11
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>

Wow, I picked up a couple of _nice_ techniques from that one post!
Looks like "data.table" will let me do SQL like things in R. I have a
warped brain. I think in "result sets" and "matrix operations"

Many thanks.

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

Williams Scott
Thanks guys - amazingly prompt solutions from the R community as always.

Yes, the c-y value reverts to just the first date event - the spirit of
this is that I am trying to identify and confirm a list of diagnoses that
a patient has coded in government administrative data. Once a diagnosis is
made and confirmed, I am not interested in whether it is listed again and
again later on. I just need that date at which it first became apparent.
So in the multiple c-y case, the min date is the correct one. Some cases
will have the same diagnosis listed dozens of times, hence the very
bloated dataset.

Time to churn through the data is not a big issue, so I will have a go
with Jim¹s neat code he just sent on perhaps a few thousand rows and see
how I get on.

S



On 17/07/2014 12:09 am, "John McKown" <[hidden email]> wrote:

>On Wed, Jul 16, 2014 at 8:51 AM, jim holtman <[hidden email]> wrote:
>> I can reproduce what you requested, but there was the question about
>> what happens with the multiple 'c-y' values.
>>
>> ====================
>>
>>> require(data.table)
>>> x <- read.table(text = 'id   date value
>> + a    2000-01-01 x
>> + a    2000-03-01 x
>> + b    2000-11-11 w
>> + c    2000-11-11 y
>> + c    2000-10-01 y
>> + c    2000-09-10 y
>> + c    2000-12-12 z
>> + c    2000-10-11 z
>> + d    2000-11-11 w
>> + d    2000-11-10 w', as.is = TRUE, header = TRUE)
>>> setDT(x)
>>> x[, date := as.Date(date)]
>>> setkey(x, id, value, date)
>>>
>>> y <- x[
>> +     , {
>> +         if (.N == 1) val <- NULL  # only one -- delete
>> +         else {
>> +             dif <- difftime(tail(date, -1), head(date, -1), units =
>>'days')
>> +             # return first value if any > 31
>> +             if (any(dif >= 31)) val <- list(date = date[1L])
>> +             else val <- NULL
>> +         }
>> +         val
>> +       }
>> +     , keyby = 'id,value'
>> +     ]
>>> y
>>    id value       date
>> 1:  a     x 2000-01-01
>> 2:  c     y 2000-09-10
>> 3:  c     z 2000-10-11
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>
>Wow, I picked up a couple of _nice_ techniques from that one post!
>Looks like "data.table" will let me do SQL like things in R. I have a
>warped brain. I think in "result sets" and "matrix operations"
>
>Many thanks.
>
>--
>There is nothing more pleasant than traveling and meeting new people!
>Genghis Khan
>
>Maranatha! <><
>John McKown


This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.  
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

arun kirshna
In reply to this post by Williams Scott
Hi,
If `dat` is the dataset

library(dplyr)
dat%>%
group_by(id,value)%>%

arrange(date=as.Date(date))%>%
filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
#Source: local data frame [3 x 3]
#Groups: id, value
#
#  id       date value
#1  a 2000-01-01     x
#2  c 2000-09-10     y
#3  c 2000-10-11     z
A.K.




On Wednesday, July 16, 2014 9:10 AM, Williams Scott <[hidden email]> wrote:
Hi R experts,

I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
in an individual (Œid¹) if they occur
more than once at least 30 days apart.


id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w


I wish to subset the data to retain rows where the value for the
individual is confirmed more than 30 days apart. So, after deleting all
rows with just one occurrence of id and value, the rest would be the
earliest occurrence of each value in each case id, provided 31 or more
days exist between the dates. If >1 value is present per id, each value
level needs to be assessed independently. This example would then reduce
to:


id   date           value
a    2000-01-01 x
c    2000-09-10 y
c    2000-10-11 z



I can do this via some crude loops and subsetting, but I am looking for as
much efficiency as possible
as the dataset has around 50 million rows to assess. Any suggestions
welcomed.

Thanks in advance

Scott Williams MD
Melbourne, Australia



This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

William Dunlap
Using base R you can solve this by doing some sorting and comparing
the first and last dates in each id-value group.  Computing the last
and last dates can be vectorized.

f1 <- function(data) {
    # sort by id, break ties with value, break remaining ties with date
    sortedData <- data[with(data, order(id, value, date)), ]
    i <- seq_len(NROW(sortedData)-1)
    # a 'group' has same id and value, entries in group are sorted by date
    isBreakPoint <- with(sortedData, id[i]!=id[i+1] | value[i]!=value[i+1])
    isFirstInGroup <- c(TRUE, isBreakPoint)
    isLastInGroup <- c(isBreakPoint, TRUE)
    sortedData[isFirstInGroup,][sortedData[isLastInGroup,"date"] -
sortedData[isFirstInGroup,"date"] >= 31,]
}
dat <- read.table(colClasses=c("character", "Date", "character"),
header=TRUE, text=
"id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w")

> f1(dat)
  id       date value
1  a 2000-01-01     x
6  c 2000-09-10     y
8  c 2000-10-11     z

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Wed, Jul 16, 2014 at 7:49 AM, arun <[hidden email]> wrote:

> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> #  id       date value
> #1  a 2000-01-01     x
> #2  c 2000-09-10     y
> #3  c 2000-10-11     z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <[hidden email]> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

William Dunlap
In reply to this post by arun kirshna
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))

Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair.  E.g.,

> dat1 <- data.frame(stringsAsFactors=FALSE, id=rep("A", 4), value=rep("x", 4), date=as.Date("2000-10-1")+c(1,1,50,50))
> f2(dat1) # want 1 output row: A, x, 2000-10-2
Source: local data frame [2 x 3]
Groups: id, value

  id value       date
1  A     x 2000-10-02
2  A     x 2000-10-02

where f2 is your code wrapped up in a function (to make testing and use easier)

f2 <- function (data)
{
    library(dplyr)
    data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
        filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}


Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Wed, Jul 16, 2014 at 7:49 AM, arun <[hidden email]> wrote:

> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> #  id       date value
> #1  a 2000-01-01     x
> #2  c 2000-09-10     y
> #3  c 2000-10-11     z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <[hidden email]> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to subset based on other row values and multiplicity

arun kirshna
Hi Bill,

Modifying `f2` seems to solve the problem.

f2 <- function (data)
{
    library(dplyr)
    data%>%
    group_by(id, value) %>%
    mutate(date=as.Date(date))%>%
    arrange(date) %>%
    filter(indx =any(c(abs(diff(date)),NA) >31)& date==min(date)) %>%
    filter(row_number()==1)

}

 f2(dat)
Source: local data frame [3 x 3]
Groups: id, value

  id       date value
1  a 2000-01-01     x
2  c 2000-09-10     y
3  c 2000-10-11     z

f2(dat1)
Source: local data frame [1 x 3]
Groups: id, value

  id value       date
1  A     x 2000-10-02



A.K.


On Wednesday, July 16, 2014 4:25 PM, William Dunlap <[hidden email]> wrote:
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))

Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair.  E.g.,

> dat1 <- data.frame(stringsAsFactors=FALSE, id=rep("A", 4), value=rep("x", 4), date=as.Date("2000-10-1")+c(1,1,50,50))
> f2(dat1) # want 1 output row: A, x, 2000-10-2
Source: local data frame [2 x 3]
Groups: id, value

  id value       date
1  A     x 2000-10-02
2  A     x 2000-10-02

where f2 is your code wrapped up in a function (to make testing and use easier)

f2 <- function (data)
{
    library(dplyr)
    data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
        filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}


Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Wed, Jul 16, 2014 at 7:49 AM, arun <[hidden email]> wrote:

> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> #  id       date value
> #1  a 2000-01-01     x
> #2  c 2000-09-10     y
> #3  c 2000-10-11     z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <[hidden email]> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.