[r] How to pick colums from a ragged array?

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[r] How to pick colums from a ragged array?

Stuart Leask
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date).
Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:


-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date).

-          (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array.

I hope someone can help. My ugly code is below, with some data for testing.


Stuart


Dr Stuart John Leask DM FRCPsych MB BChir MA
Clinical Senior Lecturer and Honorary Consultant Pychiatrist
Institute of Mental Health, Innovation Park
Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44 115 82 30419 [hidden email]<mailto:[hidden email]>
Google 'Dr Stuart Leask'


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date


# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Rui Barradas
Hello,

I'm not sure I understand it well, in the solution below the only
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing.
The first is the vector to average (to apply FUN to) and the second is
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:

> I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date).
> Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS.
> The diagnoses may change between appointments.
>
> I want to subset the data in two ways:
>
> -          define groups of patients by the first diagnosis given
>
> -          define groups of patients by the last diagnosis given.
>
> The problem:
> Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye.
>
> I suspect there is a very elegant way of doing this in R.
>
> This is what I've come up with:
>
>
> -          Sort by DATE then ID
>
> -          Make a ragged array of DATE by ID
>
> -          Remove IDs that only occur once.
>
> -          Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date).
>
> -          (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.)
>
> I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array.
>
> I hope someone can help. My ugly code is below, with some data for testing.
>
>
> Stuart
>
>
> Dr Stuart John Leask DM FRCPsych MB BChir MA
> Clinical Senior Lecturer and Honorary Consultant Pychiatrist
> Institute of Mental Health, Innovation Park
> Triumph Road, Nottingham, Notts. NG7 2TU. UK
> Tel. +44 115 82 30419 [hidden email]<mailto:[hidden email]>
> Google 'Dr Stuart Leask'
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> ,20091224,20050503,19870508,19880223,19880330)
>
> id.d <- cbind (ID,DATE )
> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
>
> # Inelegant attempt to remove IDs that only have one entry:
>
> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
> # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
> rag.t <- rag.s [ rag.s > 21000000 ]
> multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
> rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date
>
>
> # But now I'm stuck.
> # Each row of the array is rag.am$ID.
> # So I can't pick columns of DATEs from the ragged array.
>
> This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.
>
> This message has been checked for viruses but the contents of an attachment
> may still contain software viruses which could damage your computer system:
> you are advised to perform your own checks. Email communications with the
> University of Nottingham may be monitored as permitted by UK legislation.
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Rui Barradas
Hello,

Thinking again, if you just want the first/last in each ID that repeats
the DATE, the following function does the job. Since there were no such
cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ])
}

getRepeat(id.d)  # defaults to first = TRUE
getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA
>> Clinical Senior Lecturer and Honorary Consultant Pychiatrist
>> Institute of Mental Health, Innovation Park
>> Triumph Road, Nottingham, Notts. NG7 2TU. UK
>> Tel. +44 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
>> # Since DATE is in 'year mo da', if there's only one date, sum will
>> be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment
>> may still contain software viruses which could damage your computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the
>> University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Stuart Leask
Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!)
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])  

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

PIKAL Petr
In reply to this post by Stuart Leask
Hi

I did not check your code and rather followed your explanation. BTW, thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d")
id.d <- data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]


two functions to test if first two dates are the same or last two dates are the same

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE

> sapply(split(id.d, id.d$ID), testfirst)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove
id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Stuart Leask
> Sent: Tuesday, October 23, 2012 11:38 AM
> To: [hidden email]
> Subject: [R] [r] How to pick colums from a ragged array?
>
> I have a large dataset (~1 million rows) of three variables: ID
> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> date).
> Patients may have been assigned more than one diagnosis at any one
> appointment - leading to two rows, same ID and DATE but different
> DIAGNOSIS.
> The diagnoses may change between appointments.
>
> I want to subset the data in two ways:
>
> -          define groups of patients by the first diagnosis given
>
> -          define groups of patients by the last diagnosis given.
>
> The problem:
> Unfortunately, a small number of patients have been given more than one
> diagnosis at their first (or last) appointment. These individuals I
> need to identify and remove, as it's not possible to say uniquely what
> their first (or last) diagnosis was. So I need to identify and remove
> these individuals which have pairs of rows with the same ID and (lowest
> or highest) DATE. The size of the dataset precludes the option of doing
> this by eye.
>
> I suspect there is a very elegant way of doing this in R.
>
> This is what I've come up with:
>
>
> -          Sort by DATE then ID
>
> -          Make a ragged array of DATE by ID
>
> -          Remove IDs that only occur once.
>
> -          Subtract the first and second DATEs. Remove IDs for which
> this = zero, as this will only be true for IDs for which the
> appointment is recorded twice (because there were two diagnoses
> recorded on this date).
>
> -          (Then do the same to get the 'last appointment' duplicates,
> by reversing the initial sort by DATE.)
>
> I am stuck at the 'Subtract dates' step: I would like to get the data
> out of the ragged array by columns (so e.g. I end up with a matrix of
> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
> the ragged array.
>
> I hope someone can help. My ugly code is below, with some data for
> testing.
>
>
> Stuart
>
>
> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
> and Honorary Consultant Pychiatrist Institute of Mental Health,
> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
> 115 82 30419
> [hidden email]<mailto:[hidden email]>
> Google 'Dr Stuart Leask'
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> ,20091224,20050503,19870508,19880223,19880330)
>
> id.d <- cbind (ID,DATE )
> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> ragged array, 1-n DATES for every NAME
>
> # Inelegant attempt to remove IDs that only have one entry:
>
> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
> dates per row
> # Since DATE is in 'year mo da', if there's only one date, sum will be
> less than 2100000:
> rag.t <- rag.s [ rag.s > 21000000 ]
> multi.dates <- rownames ( rag.t )                         # all the IDs
> with >1 date
> rag.am <- rag.a [ multi.dates ]                           # rag.am only
> has IDs with > 1 Date
>
>
> # But now I'm stuck.
> # Each row of the array is rag.am$ID.
> # So I can't pick columns of DATEs from the ragged array.
>
> This message and any attachment are intended solely for the addressee
> and may contain confidential information. If you have received this
> message in error, please send it back to me, and immediately delete it.
> Please do not use, copy or disclose the information contained in this
> message or in any attachment.  Any views or opinions expressed by the
> author of this email do not necessarily reflect the views of the
> University of Nottingham.
>
> This message has been checked for viruses but the contents of an
> attachment may still contain software viruses which could damage your
> computer system:
> you are advised to perform your own checks. Email communications with
> the University of Nottingham may be monitored as permitted by UK
> legislation.
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

PIKAL Petr
Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else
sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,]
}


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two dates
> are the same
>
> testfirst <- function(x) x[1,2]==x[2,2]
> testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These individuals
> > I need to identify and remove, as it's not possible to say uniquely
> > what their first (or last) diagnosis was. So I need to identify and
> > remove these individuals which have pairs of rows with the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the data
> > out of the ragged array by columns (so e.g. I end up with a matrix of
> > ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
> > the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
> > and Honorary Consultant Pychiatrist Institute of Mental Health,
> > Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
> > dates per row
> > # Since DATE is in 'year mo da', if there's only one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the addressee
> > and may contain confidential information. If you have received this
> > message in error, please send it back to me, and immediately delete
> it.
> > Please do not use, copy or disclose the information contained in this
> > message or in any attachment.  Any views or opinions expressed by the
> > author of this email do not necessarily reflect the views of the
> > University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage your
> > computer system:
> > you are advised to perform your own checks. Email communications with
> > the University of Nottingham may be monitored as permitted by UK
> > legislation.
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Stuart Leask
In reply to this post by Rui Barradas
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
 # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
 # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.


-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: [hidden email]
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])  

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Stuart Leask
In reply to this post by PIKAL Petr
Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:[hidden email]]
Sent: 23 October 2012 13:15
To: Stuart Leask; [hidden email]
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two
> dates are the same
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These
> > individuals I need to identify and remove, as it's not possible to
> > say uniquely what their first (or last) diagnosis was. So I need to
> > identify and remove these individuals which have pairs of rows with
> > the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the
> > data out of the ragged array by columns (so e.g. I end up with a
> > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > column from the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
> > Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
> > dates per row
> > # Since DATE is in 'year mo da', if there's only one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the
> > addressee and may contain confidential information. If you have
> > received this message in error, please send it back to me, and
> > immediately delete
> it.
> > Please do not use, copy or disclose the information contained in
> > this message or in any attachment.  Any views or opinions expressed
> > by the author of this email do not necessarily reflect the views of
> > the University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage
> > your computer system:
> > you are advised to perform your own checks. Email communications
> > with the University of Nottingham may be monitored as permitted by
> > UK legislation.
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting- 
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained,
> reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

FW: [r] How to pick colums from a ragged array?

Stuart Leask
In reply to this post by Rui Barradas
Sorry, I must be a bit thick.!
getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:

> g.r<-getRepeat(id.d)
> dim(g.r)
NULL
> summary(g.r)
     Length Class      Mode
[1,] 2      data.frame list
[2,] 2      data.frame list
[3,] 2      data.frame list

This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?

I need to end up with a list of all the IDs that have duplicate first (or last) DATES.

Stuart

-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: '[hidden email]'
Subject: RE: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
 # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
 # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.


-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: [hidden email]
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])  

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: FW: [r] How to pick colums from a ragged array?

Rui Barradas
Hello,

You're right, getRepeat returns a list of data.frames, one per each ID.
To put them all in the same df use

do.call(rbind, g.r)

Rui Barradas
Em 23-10-2012 13:36, Stuart Leask escreveu:

> Sorry, I must be a bit thick.!
> getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
>
>> g.r<-getRepeat(id.d)
>> dim(g.r)
> NULL
>> summary(g.r)
>       Length Class      Mode
> [1,] 2      data.frame list
> [2,] 2      data.frame list
> [3,] 2      data.frame list
>
> This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
>
> I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
>
> Stuart
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 13:17
> To: 'Rui Barradas'
> Cc: '[hidden email]'
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Ah, no, my method does fail.
> Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
> Test data is all - see below it failing.
>
> So, I remain very grateful for your function!
>
> Stuart
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>   ,20091224,20050503,19870508,19870508,19880330)
>
>   id.d <- cbind (ID,DATE )
> # rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
>   # Inelegant attempt to remove IDs that only have one entry:
> # rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
>   # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
> # rag.t <- rag.s [ rag.s > 21000000 ]
> # multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
> # rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> #ni<-dim(nd.b)[1]
> #nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
> #nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates
>
> earliest<-tapply(DATE,ID,min)  # table of mins
> rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
> # This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
>
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 12:38
> To: 'Rui Barradas'
> Cc: [hidden email]
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Thanks Rui - your initial, very elegant suggestion, has spurred me on!
>
> 1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
>
> 2. Your suggestion gave me all the duplicates:
>
> how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
> nd.b<- id.d [ how.many  > 1,  ]
>
> 3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
>
> earliest <- tapply ( DATE, ID, min)
> rownames(earliest[earliest%in%nd.b])
>
> This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
> I'm trying to work out why!
>
>
> Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
>
>
> Stuart
>
>
> ************************************************
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
>   ,20091224,20050503,19870508,19870508,19880330)
>
>   id.d <- cbind (ID,DATE )
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
> rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...
>
>
>
>
> ******************************************************************
>
>
>
> -----Original Message-----
> From: Rui Barradas [mailto:[hidden email]]
> Sent: 23 October 2012 12:21
> To: Stuart Leask
> Cc: [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hello,
>
> Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
> ,1019,1019)
>
> DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
> ,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
>
> id.d <- cbind(ID, DATE)
>
>
> getRepeat <- function(x, first = TRUE){
>       fun <- if(first) head else tail
>       sp <- split(data.frame(x), x[,1])
>       first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
>       lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
>       n <- unlist(lapply(lst, sum))
>       sp1 <- sp[n > 1]
>       i1 <- lst[n > 1]
>       lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
>
> getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Em 23-10-2012 10:59, Rui Barradas escreveu:
>> Hello,
>>
>> I'm not sure I understand it well, in the solution below the only
>> returned value is ID == 814 but it's not the first nor the last DATE.
>>
>> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
>> id.d[how.many > 1, ]
>>
>> See the help page for ?ave if the repetition of id.d[,1] is confusing.
>> The first is the vector to average (to apply FUN to) and the second is
>> one of thw two vectors defining the groups.
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 23-10-2012 10:37, Stuart Leask escreveu:
>>> I have a large dataset (~1 million rows) of three variables: ID
>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>>> date).
>>> Patients may have been assigned more than one diagnosis at any one
>>> appointment - leading to two rows, same ID and DATE but different
>>> DIAGNOSIS.
>>> The diagnoses may change between appointments.
>>>
>>> I want to subset the data in two ways:
>>>
>>> -          define groups of patients by the first diagnosis given
>>>
>>> -          define groups of patients by the last diagnosis given.
>>>
>>> The problem:
>>> Unfortunately, a small number of patients have been given more than
>>> one diagnosis at their first (or last) appointment. These individuals
>>> I need to identify and remove, as it's not possible to say uniquely
>>> what their first (or last) diagnosis was. So I need to identify and
>>> remove these individuals which have pairs of rows with the same ID
>>> and (lowest or highest) DATE. The size of the dataset precludes the
>>> option of doing this by eye.
>>>
>>> I suspect there is a very elegant way of doing this in R.
>>>
>>> This is what I've come up with:
>>>
>>>
>>> -          Sort by DATE then ID
>>>
>>> -          Make a ragged array of DATE by ID
>>>
>>> -          Remove IDs that only occur once.
>>>
>>> -          Subtract the first and second DATEs. Remove IDs for which
>>> this = zero, as this will only be true for IDs for which the
>>> appointment is recorded twice (because there were two diagnoses
>>> recorded on this date).
>>>
>>> -          (Then do the same to get the 'last appointment'
>>> duplicates, by reversing the initial sort by DATE.)
>>>
>>> I am stuck at the 'Subtract dates' step: I would like to get the data
>>> out of the ragged array by columns (so e.g. I end up with a matrix of
>>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>>> the ragged array.
>>>
>>> I hope someone can help. My ugly code is below, with some data for
>>> testing.
>>>
>>>
>>> Stuart
>>>
>>>
>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>>> 115 82 30419
>>> [hidden email]<mailto:[hidden email]>
>>> Google 'Dr Stuart Leask'
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>> ,20091224,20050503,19870508,19880223,19880330)
>>>
>>> id.d <- cbind (ID,DATE )
>>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>>> ragged array, 1-n DATES for every NAME
>>>
>>> # Inelegant attempt to remove IDs that only have one entry:
>>>
>>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>>> row # Since DATE is in 'year mo da', if there's only one date, sum
>>> will be less than 2100000:
>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>> multi.dates <- rownames ( rag.t )                         # all the
>>> IDs with >1 date
>>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>>> only has IDs with > 1 Date
>>>
>>>
>>> # But now I'm stuck.
>>> # Each row of the array is rag.am$ID.
>>> # So I can't pick columns of DATEs from the ragged array.
>>>
>>> This message and any attachment are intended solely for the addressee
>>> and may contain confidential information. If you have received this
>>> message in error, please send it back to me, and immediately delete
>>> it.   Please do not use, copy or disclose the information contained
>>> in this message or in any attachment.  Any views or opinions
>>> expressed by the author of this email do not necessarily reflect the
>>> views of the University of Nottingham.
>>>
>>> This message has been checked for viruses but the contents of an
>>> attachment may still contain software viruses which could damage your
>>> computer
>>> system:
>>> you are advised to perform your own checks. Email communications with
>>> the University of Nottingham may be monitored as permitted by UK
>>> legislation.
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

PIKAL Petr
In reply to this post by Stuart Leask
Hi

> -----Original Message-----
> From: Stuart Leask [mailto:[hidden email]]
> Sent: Tuesday, October 23, 2012 2:29 PM
> To: PIKAL Petr; [hidden email]
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi there.
>
> Not sure I follow what you are doing.
>
> I want a list of all the IDs that have duplicate DATE entries, only
> when the DATE is the earliest (or last) date for that ID.

And that is what the function (with 3 small modifications) does


fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[nrow(x),2]==x[nrow(x)-1,2]

if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testfirst))))) else
sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testlast)))))

if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
}

See the result of your refined data

fff(id.d)
     ID       DATE
5   167 2004-02-05
6   167 2004-02-05
22  841 2005-04-21
23  841 2005-04-21
24  841 2006-04-28
25  841 2006-06-02
26  841 2006-08-16
27  841 2006-10-25
28  841 2006-11-29
29  841 2007-01-12
30  841 2007-05-14
38 1019 1987-05-08
39 1019 1987-05-08
40 1019 1988-03-30
> fff(id.d, first=F)
   ID       DATE
5 167 2004-02-05
6 167 2004-02-05
> fff(id.d, remove=T)
    ID       DATE
1   58 2006-08-21
2   58 2006-12-07
3   58 2008-01-02
4   58 2009-09-04
7  323 2005-11-11
8  323 2006-01-11
9  323 2007-11-19
10 323 2008-01-07
11 323 2008-04-07
12 323 2008-05-21
13 323 2008-07-11
14 547 2004-10-05
15 794 2007-09-05
16 814 2002-08-14
17 814 2002-11-25
18 814 2004-04-29
19 814 2004-04-29
20 814 2007-12-05
21 814 2008-02-27
31 910 1987-05-08
32 910 2004-02-05
33 910 2004-02-05
34 910 2009-11-20
35 910 2009-12-10
36 910 2009-12-24
37 999 2005-05-03
>

You can do surgery on fff function to see what result comes from some piece of the function e.g.

sapply(split(id.d, id.d[,1]), testlast)

Regards
Petr

>
> I have refined my test dataset, to include some tests (e.g. 910 has the
> same dup as 1019, but for 910 it's not the earliest date):
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>  ,20091224,20050503,19870508,19870508,19880330)
>
> Correct output:
> "167"  "841"  "1019"
>
> Stuart
>
> -----Original Message-----
> From: PIKAL Petr [mailto:[hidden email]]
> Sent: 23 October 2012 13:15
> To: Stuart Leask; [hidden email]
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi
>
> Rui's answer brought me to more elaborated solution which still needs
> data frame to be ordered by date
>
> fff<-function(data, first=TRUE, remove=FALSE) {
>
> testfirst <- function(x) x[1,2]==x[2,2]
> testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
>
> if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]),
> testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
> data[,1]), testlast))))
>
> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>
>
> > fff(id.d)
>     ID     DATE
> 31 910 20091105
> 32 910 20091105
> 33 910 20091117
> 34 910 20091119
> 35 910 20091120
> 36 910 20091210
> 37 910 20091224
> 38 910 20091224
>
> > fff(id.d, remove=T)
>      ID     DATE
> 1    58 20060821
> 2    58 20061207
> 3    58 20080102
> 4    58 20090904
> 5   167 20040205
> 6   167 20040323
> 7   323 20051111
> 8   323 20060111
> 9   323 20071119
> 10  323 20080107
> 11  323 20080407
> 12  323 20080521
> 13  323 20080711
> 14  547 20041005
> 15  794 20070905
> 16  814 20020814
> 17  814 20021125
> 18  814 20040429
> 19  814 20040429
> 20  814 20071205
> 21  814 20080227
> 22  841 20050421
> 23  841 20060130
> 24  841 20060428
> 25  841 20060602
> 26  841 20060816
> 27  841 20061025
> 28  841 20061129
> 29  841 20070112
> 30  841 20070514
> 39  999 20050503
> 40 1019 19870508
> 41 1019 19880223
> 42 1019 19880330
> 43 1019 19880330
> >
>
> Regards
> Petr
>
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of PIKAL Petr
> > Sent: Tuesday, October 23, 2012 1:49 PM
> > To: Stuart Leask; [hidden email]
> > Subject: Re: [R] [r] How to pick colums from a ragged array?
> >
> > Hi
> >
> > I did not check your code and rather followed your explanation. BTW,
> > thanks for test data.
> >
> > small change in data frame to make DATE as Date class
> >
> > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> > data.frame(ID,datum )
> >
> > ordering by date
> >
> > id.d<-id.d[order(id.d$datum),]
> >
> >
> > two functions to test if first two dates are the same or last two
> > dates are the same
> >
> > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > x[length(x),2]==x[length(x)-1,2]
> >
> > change one last date in the data frame to be the same as previous
> >
> > id.d[35,2]<-id.d[36,2]
> >
> > and here are results
> >
> > sapply(split(id.d, id.d$ID), testlast)
> >    58   167   323   547   794   814   841   910   999  1019
> > FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
> >
> > > sapply(split(id.d, id.d$ID), testfirst)
> >    58   167   323   547   794   814   841   910   999  1019
> > FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
> >
> > Now you can select ID which is true and remove it from your data
> > which(sapply(split(id.d, id.d$ID), testlast))
> >
> > and use it for your data frame to subset/remove id.d$ID ==
> > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
> >
> > However I am not sure if this is exactly what you want.
> >
> > Regards
> > Petr
> >
> > > -----Original Message-----
> > > From: [hidden email] [mailto:r-help-bounces@r-
> > > project.org] On Behalf Of Stuart Leask
> > > Sent: Tuesday, October 23, 2012 11:38 AM
> > > To: [hidden email]
> > > Subject: [R] [r] How to pick colums from a ragged array?
> > >
> > > I have a large dataset (~1 million rows) of three variables: ID
> > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on
> that
> > > date).
> > > Patients may have been assigned more than one diagnosis at any one
> > > appointment - leading to two rows, same ID and DATE but different
> > > DIAGNOSIS.
> > > The diagnoses may change between appointments.
> > >
> > > I want to subset the data in two ways:
> > >
> > > -          define groups of patients by the first diagnosis given
> > >
> > > -          define groups of patients by the last diagnosis given.
> > >
> > > The problem:
> > > Unfortunately, a small number of patients have been given more than
> > > one diagnosis at their first (or last) appointment. These
> > > individuals I need to identify and remove, as it's not possible to
> > > say uniquely what their first (or last) diagnosis was. So I need to
> > > identify and remove these individuals which have pairs of rows with
> > > the same ID
> > and
> > > (lowest or highest) DATE. The size of the dataset precludes the
> > option
> > > of doing this by eye.
> > >
> > > I suspect there is a very elegant way of doing this in R.
> > >
> > > This is what I've come up with:
> > >
> > >
> > > -          Sort by DATE then ID
> > >
> > > -          Make a ragged array of DATE by ID
> > >
> > > -          Remove IDs that only occur once.
> > >
> > > -          Subtract the first and second DATEs. Remove IDs for
> which
> > > this = zero, as this will only be true for IDs for which the
> > > appointment is recorded twice (because there were two diagnoses
> > > recorded on this date).
> > >
> > > -          (Then do the same to get the 'last appointment'
> > duplicates,
> > > by reversing the initial sort by DATE.)
> > >
> > > I am stuck at the 'Subtract dates' step: I would like to get the
> > > data out of the ragged array by columns (so e.g. I end up with a
> > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > > column from the ragged array.
> > >
> > > I hope someone can help. My ugly code is below, with some data for
> > > testing.
> > >
> > >
> > > Stuart
> > >
> > >
> > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
> UK
> > > Tel. +44
> > > 115 82 30419
> > > [hidden email]<mailto:[hidden email]>
> > > Google 'Dr Stuart Leask'
> > >
> > >
> > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > ,1019)
> > >
> > > DATE <-
> > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > > ,20091224,20050503,19870508,19880223,19880330)
> > >
> > > id.d <- cbind (ID,DATE )
> > > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > > ragged array, 1-n DATES for every NAME
> > >
> > > # Inelegant attempt to remove IDs that only have one entry:
> > >
> > > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> the
> > > dates per row
> > > # Since DATE is in 'year mo da', if there's only one date, sum will
> > be
> > > less than 2100000:
> > > rag.t <- rag.s [ rag.s > 21000000 ]
> > > multi.dates <- rownames ( rag.t )                         # all the
> > IDs
> > > with >1 date
> > > rag.am <- rag.a [ multi.dates ]                           # rag.am
> > only
> > > has IDs with > 1 Date
> > >
> > >
> > > # But now I'm stuck.
> > > # Each row of the array is rag.am$ID.
> > > # So I can't pick columns of DATEs from the ragged array.
> > >
> > > This message and any attachment are intended solely for the
> > > addressee and may contain confidential information. If you have
> > > received this message in error, please send it back to me, and
> > > immediately delete
> > it.
> > > Please do not use, copy or disclose the information contained in
> > > this message or in any attachment.  Any views or opinions expressed
> > > by the author of this email do not necessarily reflect the views of
> > > the University of Nottingham.
> > >
> > > This message has been checked for viruses but the contents of an
> > > attachment may still contain software viruses which could damage
> > > your computer system:
> > > you are advised to perform your own checks. Email communications
> > > with the University of Nottingham may be monitored as permitted by
> > > UK legislation.
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-
> > > guide.html and provide commented, minimal, self-contained,
> > > reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by Stuart Leask
HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]


tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421
  #  1019      167      814      841
A.K.




----- Original Message -----
From: Stuart Leask <[hidden email]>
To: PIKAL Petr <[hidden email]>; "[hidden email]" <[hidden email]>
Cc:
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:[hidden email]]
Sent: 23 October 2012 13:15
To: Stuart Leask; [hidden email]
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two
> dates are the same
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019
> FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These
> > individuals I need to identify and remove, as it's not possible to
> > say uniquely what their first (or last) diagnosis was. So I need to
> > identify and remove these individuals which have pairs of rows with
> > the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the
> > data out of the ragged array by columns (so e.g. I end up with a
> > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > column from the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
> > Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
> > dates per row
> > # Since DATE is in 'year mo da', if there's only one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the
> > addressee and may contain confidential information. If you have
> > received this message in error, please send it back to me, and
> > immediately delete
> it.
> > Please do not use, copy or disclose the information contained in
> > this message or in any attachment.  Any views or opinions expressed
> > by the author of this email do not necessarily reflect the views of
> > the University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage
> > your computer system:
> > you are advised to perform your own checks. Email communications
> > with the University of Nottingham may be monitored as permitted by
> > UK legislation.
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting- 
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained,
> reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: FW: [r] How to pick colums from a ragged array?

Stuart Leask
In reply to this post by Rui Barradas
So I get my list of IDs to exclude from:

g.rr<-do.call(rbind, g.r)[1]
dim(g.rr)
g.rr[1:(dim(g.rr)[1]/2)]

Many thanks.

Stuart

-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 13:42
To: Stuart Leask
Cc: [hidden email]
Subject: Re: FW: [R] [r] How to pick colums from a ragged array?

Hello,

You're right, getRepeat returns a list of data.frames, one per each ID.
To put them all in the same df use

do.call(rbind, g.r)

Rui Barradas
Em 23-10-2012 13:36, Stuart Leask escreveu:

> Sorry, I must be a bit thick.!
> getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
>
>> g.r<-getRepeat(id.d)
>> dim(g.r)
> NULL
>> summary(g.r)
>       Length Class      Mode
> [1,] 2      data.frame list
> [2,] 2      data.frame list
> [3,] 2      data.frame list
>
> This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
>
> I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
>
> Stuart
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 13:17
> To: 'Rui Barradas'
> Cc: '[hidden email]'
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Ah, no, my method does fail.
> Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
> Test data is all - see below it failing.
>
> So, I remain very grateful for your function!
>
> Stuart
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>   ,20091224,20050503,19870508,19870508,19880330)
>
>   id.d <- cbind (ID,DATE )
> # rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
>   # Inelegant attempt to remove IDs that only have one entry:
> # rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
>   # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
> # rag.t <- rag.s [ rag.s > 21000000 ]
> # multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
> # rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> #ni<-dim(nd.b)[1]
> #nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
> #nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates
>
> earliest<-tapply(DATE,ID,min)  # table of mins
> rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
> # This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
>
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 12:38
> To: 'Rui Barradas'
> Cc: [hidden email]
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Thanks Rui - your initial, very elegant suggestion, has spurred me on!
>
> 1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
>
> 2. Your suggestion gave me all the duplicates:
>
> how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
> nd.b<- id.d [ how.many  > 1,  ]
>
> 3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
>
> earliest <- tapply ( DATE, ID, min)
> rownames(earliest[earliest%in%nd.b])
>
> This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
> I'm trying to work out why!
>
>
> Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
>
>
> Stuart
>
>
> ************************************************
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
>   ,20091224,20050503,19870508,19870508,19880330)
>
>   id.d <- cbind (ID,DATE )
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
> rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...
>
>
>
>
> ******************************************************************
>
>
>
> -----Original Message-----
> From: Rui Barradas [mailto:[hidden email]]
> Sent: 23 October 2012 12:21
> To: Stuart Leask
> Cc: [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hello,
>
> Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
> ,1019,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091105,20091117,20091119,20091120,200912
> 10
> ,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
>
> id.d <- cbind(ID, DATE)
>
>
> getRepeat <- function(x, first = TRUE){
>       fun <- if(first) head else tail
>       sp <- split(data.frame(x), x[,1])
>       first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
>       lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
>       n <- unlist(lapply(lst, sum))
>       sp1 <- sp[n > 1]
>       i1 <- lst[n > 1]
>       lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
>
> getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first =
> FALSE)  # to get the last ones
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Em 23-10-2012 10:59, Rui Barradas escreveu:
>> Hello,
>>
>> I'm not sure I understand it well, in the solution below the only
>> returned value is ID == 814 but it's not the first nor the last DATE.
>>
>> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
>> id.d[how.many > 1, ]
>>
>> See the help page for ?ave if the repetition of id.d[,1] is confusing.
>> The first is the vector to average (to apply FUN to) and the second
>> is one of thw two vectors defining the groups.
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 23-10-2012 10:37, Stuart Leask escreveu:
>>> I have a large dataset (~1 million rows) of three variables: ID
>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>>> date).
>>> Patients may have been assigned more than one diagnosis at any one
>>> appointment - leading to two rows, same ID and DATE but different
>>> DIAGNOSIS.
>>> The diagnoses may change between appointments.
>>>
>>> I want to subset the data in two ways:
>>>
>>> -          define groups of patients by the first diagnosis given
>>>
>>> -          define groups of patients by the last diagnosis given.
>>>
>>> The problem:
>>> Unfortunately, a small number of patients have been given more than
>>> one diagnosis at their first (or last) appointment. These
>>> individuals I need to identify and remove, as it's not possible to
>>> say uniquely what their first (or last) diagnosis was. So I need to
>>> identify and remove these individuals which have pairs of rows with
>>> the same ID and (lowest or highest) DATE. The size of the dataset
>>> precludes the option of doing this by eye.
>>>
>>> I suspect there is a very elegant way of doing this in R.
>>>
>>> This is what I've come up with:
>>>
>>>
>>> -          Sort by DATE then ID
>>>
>>> -          Make a ragged array of DATE by ID
>>>
>>> -          Remove IDs that only occur once.
>>>
>>> -          Subtract the first and second DATEs. Remove IDs for which
>>> this = zero, as this will only be true for IDs for which the
>>> appointment is recorded twice (because there were two diagnoses
>>> recorded on this date).
>>>
>>> -          (Then do the same to get the 'last appointment'
>>> duplicates, by reversing the initial sort by DATE.)
>>>
>>> I am stuck at the 'Subtract dates' step: I would like to get the
>>> data out of the ragged array by columns (so e.g. I end up with a
>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
>>> column from the ragged array.
>>>
>>> I hope someone can help. My ugly code is below, with some data for
>>> testing.
>>>
>>>
>>> Stuart
>>>
>>>
>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
>>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental
>>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
>>> Tel. +44
>>> 115 82 30419
>>> [hidden email]<mailto:[hidden email]>
>>> Google 'Dr Stuart Leask'
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>> ,20091224,20050503,19870508,19880223,19880330)
>>>
>>> id.d <- cbind (ID,DATE )
>>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>>> ragged array, 1-n DATES for every NAME
>>>
>>> # Inelegant attempt to remove IDs that only have one entry:
>>>
>>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>>> row # Since DATE is in 'year mo da', if there's only one date, sum
>>> will be less than 2100000:
>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>> multi.dates <- rownames ( rag.t )                         # all the
>>> IDs with >1 date
>>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>>> only has IDs with > 1 Date
>>>
>>>
>>> # But now I'm stuck.
>>> # Each row of the array is rag.am$ID.
>>> # So I can't pick columns of DATEs from the ragged array.
>>>
>>> This message and any attachment are intended solely for the
>>> addressee and may contain confidential information. If you have
>>> received this message in error, please send it back to me, and immediately delete
>>> it.   Please do not use, copy or disclose the information contained
>>> in this message or in any attachment.  Any views or opinions
>>> expressed by the author of this email do not necessarily reflect the
>>> views of the University of Nottingham.
>>>
>>> This message has been checked for viruses but the contents of an
>>> attachment may still contain software viruses which could damage
>>> your computer
>>> system:
>>> you are advised to perform your own checks. Email communications
>>> with the University of Nottingham may be monitored as permitted by
>>> UK legislation.
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by arun kirshna
Hi,
tapply(new1[,1],new1[,2],head,1) # works.

I used:
id.d<-data.frame(ID,DATE)
#In that case,
tapply(new1$ID,new1$DATE,head,1)
#works

On closer look, I think you don't want "814" id.  Not sure about the logic behind that.

A.K.





----- Original Message -----
From: Stuart Leask <[hidden email]>
To: arun <[hidden email]>
Cc: Petr PIKAL <[hidden email]>
Sent: Tuesday, October 23, 2012 9:15 AM
Subject: RE: [R] [r] How to pick colums from a ragged array?

Sorry Arun, but when I run it I get an error:

> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
+ ,547,794,814,814,814,814,814,814,841,841,841,841,841
+ ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
+ ,1019)
>
> DATE <-
+  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
+  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
+  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
+  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
+  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
+  ,20091224,20050503,19870508,19870508,19880330)
>
>  id.d <- cbind (ID,DATE )
> new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
>
>
> tapply(new1$ID,new1$DATE,head,1)
Error in new1$DATE : $ operator is invalid for atomic vectors




-----Original Message-----
From: arun [mailto:[hidden email]]
Sent: 23 October 2012 14:05
To: Stuart Leask
Cc: R help; Petr PIKAL
Subject: Re: [R] [r] How to pick colums from a ragged array?

HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]


tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421
  #  1019      167      814      841
A.K.




----- Original Message -----
From: Stuart Leask <[hidden email]>
To: PIKAL Petr <[hidden email]>; "[hidden email]" <[hidden email]>
Cc:
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:[hidden email]]
Sent: 23 October 2012 13:15
To: Stuart Leask; [hidden email]
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two
> dates are the same
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These
> > individuals I need to identify and remove, as it's not possible to
> > say uniquely what their first (or last) diagnosis was. So I need to
> > identify and remove these individuals which have pairs of rows with
> > the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the
> > data out of the ragged array by columns (so e.g. I end up with a
> > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > column from the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
> > Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> > the dates per row # Since DATE is in 'year mo da', if there's only
> > one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the
> > addressee and may contain confidential information. If you have
> > received this message in error, please send it back to me, and
> > immediately delete
> it.
> > Please do not use, copy or disclose the information contained in
> > this message or in any attachment.  Any views or opinions expressed
> > by the author of this email do not necessarily reflect the views of
> > the University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage
> > your computer system:
> > you are advised to perform your own checks. Email communications
> > with the University of Nottingham may be monitored as permitted by
> > UK legislation.
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting- 
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained,
> reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by arun kirshna
Hi,
Also one more thing:
This should get the dates which are duplicated.  In my first reply, I was looking for the duplicated rows. Sorry for that!

id.d<-data.frame(ID,DATE)

new1<-id.d[duplicated(id.d$DATE)|duplicated(id.d$DATE,fromLast=TRUE),]


new2<-new1[order(new1$ID,new1$DATE),]
 tapply(new2$ID,new2$DATE,head,1)
#19870508 20040205 20040429 20050421
  #   910      167      814      841

But, still the result is not that you wanted, because 910's date is the earliest date when compared to 1019.
new1[order(new1$ID,new1$DATE),]
#     ID     DATE
#5   167 20040205
#6   167 20040205
#18  814 20040429
#19  814 20040429
#22  841 20050421
#23  841 20050421
#31  910 19870508
#32  910 20040205
#33  910 20040205
#38 1019 19870508
#39 1019 19870508

A.K.

----- Original Message -----
From: Stuart Leask <[hidden email]>
To: arun <[hidden email]>
Cc: Petr PIKAL <[hidden email]>
Sent: Tuesday, October 23, 2012 9:15 AM
Subject: RE: [R] [r] How to pick colums from a ragged array?

Sorry Arun, but when I run it I get an error:

> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
+ ,547,794,814,814,814,814,814,814,841,841,841,841,841
+ ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
+ ,1019)
>
> DATE <-
+  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
+  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
+  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
+  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
+  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
+  ,20091224,20050503,19870508,19870508,19880330)
>
>  id.d <- cbind (ID,DATE )
> new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
>
>
> tapply(new1$ID,new1$DATE,head,1)
Error in new1$DATE : $ operator is invalid for atomic vectors




-----Original Message-----
From: arun [mailto:[hidden email]]
Sent: 23 October 2012 14:05
To: Stuart Leask
Cc: R help; Petr PIKAL
Subject: Re: [R] [r] How to pick colums from a ragged array?

HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]


tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421
  #  1019      167      814      841
A.K.




----- Original Message -----
From: Stuart Leask <[hidden email]>
To: PIKAL Petr <[hidden email]>; "[hidden email]" <[hidden email]>
Cc:
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:[hidden email]]
Sent: 23 October 2012 13:15
To: Stuart Leask; [hidden email]
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two
> dates are the same
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These
> > individuals I need to identify and remove, as it's not possible to
> > say uniquely what their first (or last) diagnosis was. So I need to
> > identify and remove these individuals which have pairs of rows with
> > the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the
> > data out of the ragged array by columns (so e.g. I end up with a
> > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > column from the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
> > Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> > the dates per row # Since DATE is in 'year mo da', if there's only
> > one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the
> > addressee and may contain confidential information. If you have
> > received this message in error, please send it back to me, and
> > immediately delete
> it.
> > Please do not use, copy or disclose the information contained in
> > this message or in any attachment.  Any views or opinions expressed
> > by the author of this email do not necessarily reflect the views of
> > the University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage
> > your computer system:
> > you are advised to perform your own checks. Email communications
> > with the University of Nottingham may be monitored as permitted by
> > UK legislation.
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting- 
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained,
> reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Stuart Leask
I too had a parsimonious solution that was also fooled by IDs that had a duplicate date that wasn't the first date, but was the same as another ID's duplicate+first.

The right answer
>From this data:

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

is:

167, 841  and 1019 - correct.
814 910 - incorrect. Although there are duplicate dates, they are not the first date.

-----Original Message-----
From: arun [mailto:[hidden email]]
Sent: 23 October 2012 14:29
To: Stuart Leask
Cc: R help
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi,
Also one more thing:
This should get the dates which are duplicated.  In my first reply, I was looking for the duplicated rows. Sorry for that!

id.d<-data.frame(ID,DATE)

new1<-id.d[duplicated(id.d$DATE)|duplicated(id.d$DATE,fromLast=TRUE),]


new2<-new1[order(new1$ID,new1$DATE),]
 tapply(new2$ID,new2$DATE,head,1)
#19870508 20040205 20040429 20050421
  #   910      167      814      841

But, still the result is not that you wanted, because 910's date is the earliest date when compared to 1019.
new1[order(new1$ID,new1$DATE),]
#     ID     DATE
#5   167 20040205
#6   167 20040205
#18  814 20040429
#19  814 20040429
#22  841 20050421
#23  841 20050421
#31  910 19870508
#32  910 20040205
#33  910 20040205
#38 1019 19870508
#39 1019 19870508

A.K.

----- Original Message -----
From: Stuart Leask <[hidden email]>
To: arun <[hidden email]>
Cc: Petr PIKAL <[hidden email]>
Sent: Tuesday, October 23, 2012 9:15 AM
Subject: RE: [R] [r] How to pick colums from a ragged array?

Sorry Arun, but when I run it I get an error:

> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
+ ,547,794,814,814,814,814,814,814,841,841,841,841,841
+ ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
+ ,1019)
>
> DATE <-
+  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
+  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
+  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
+  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
+  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
+  ,20091224,20050503,19870508,19870508,19880330)
>
>  id.d <- cbind (ID,DATE )
> new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
>
>
> tapply(new1$ID,new1$DATE,head,1)
Error in new1$DATE : $ operator is invalid for atomic vectors




-----Original Message-----
From: arun [mailto:[hidden email]]
Sent: 23 October 2012 14:05
To: Stuart Leask
Cc: R help; Petr PIKAL
Subject: Re: [R] [r] How to pick colums from a ragged array?

HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]


tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421
  #  1019      167      814      841
A.K.




----- Original Message -----
From: Stuart Leask <[hidden email]>
To: PIKAL Petr <[hidden email]>; "[hidden email]" <[hidden email]>
Cc:
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:[hidden email]]
Sent: 23 October 2012 13:15
To: Stuart Leask; [hidden email]
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }


> fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

> fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330
>

Regards
Petr


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of PIKAL Petr
> Sent: Tuesday, October 23, 2012 1:49 PM
> To: Stuart Leask; [hidden email]
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi
>
> I did not check your code and rather followed your explanation. BTW,
> thanks for test data.
>
> small change in data frame to make DATE as Date class
>
> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> data.frame(ID,datum )
>
> ordering by date
>
> id.d<-id.d[order(id.d$datum),]
>
>
> two functions to test if first two dates are the same or last two
> dates are the same
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[length(x),2]==x[length(x)-1,2]
>
> change one last date in the data frame to be the same as previous
>
> id.d[35,2]<-id.d[36,2]
>
> and here are results
>
> sapply(split(id.d, id.d$ID), testlast)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>
> > sapply(split(id.d, id.d$ID), testfirst)
>    58   167   323   547   794   814   841   910   999  1019  FALSE
>FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>
> Now you can select ID which is true and remove it from your data
> which(sapply(split(id.d, id.d$ID), testlast))
>
> and use it for your data frame to subset/remove id.d$ID ==
> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>
> However I am not sure if this is exactly what you want.
>
> Regards
> Petr
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Stuart Leask
> > Sent: Tuesday, October 23, 2012 11:38 AM
> > To: [hidden email]
> > Subject: [R] [r] How to pick colums from a ragged array?
> >
> > I have a large dataset (~1 million rows) of three variables: ID
> > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
> > date).
> > Patients may have been assigned more than one diagnosis at any one
> > appointment - leading to two rows, same ID and DATE but different
> > DIAGNOSIS.
> > The diagnoses may change between appointments.
> >
> > I want to subset the data in two ways:
> >
> > -          define groups of patients by the first diagnosis given
> >
> > -          define groups of patients by the last diagnosis given.
> >
> > The problem:
> > Unfortunately, a small number of patients have been given more than
> > one diagnosis at their first (or last) appointment. These
> > individuals I need to identify and remove, as it's not possible to
> > say uniquely what their first (or last) diagnosis was. So I need to
> > identify and remove these individuals which have pairs of rows with
> > the same ID
> and
> > (lowest or highest) DATE. The size of the dataset precludes the
> option
> > of doing this by eye.
> >
> > I suspect there is a very elegant way of doing this in R.
> >
> > This is what I've come up with:
> >
> >
> > -          Sort by DATE then ID
> >
> > -          Make a ragged array of DATE by ID
> >
> > -          Remove IDs that only occur once.
> >
> > -          Subtract the first and second DATEs. Remove IDs for which
> > this = zero, as this will only be true for IDs for which the
> > appointment is recorded twice (because there were two diagnoses
> > recorded on this date).
> >
> > -          (Then do the same to get the 'last appointment'
> duplicates,
> > by reversing the initial sort by DATE.)
> >
> > I am stuck at the 'Subtract dates' step: I would like to get the
> > data out of the ragged array by columns (so e.g. I end up with a
> > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > column from the ragged array.
> >
> > I hope someone can help. My ugly code is below, with some data for
> > testing.
> >
> >
> > Stuart
> >
> >
> > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
> > Tel. +44
> > 115 82 30419
> > [hidden email]<mailto:[hidden email]>
> > Google 'Dr Stuart Leask'
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > ,20091224,20050503,19870508,19880223,19880330)
> >
> > id.d <- cbind (ID,DATE )
> > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > ragged array, 1-n DATES for every NAME
> >
> > # Inelegant attempt to remove IDs that only have one entry:
> >
> > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> > the dates per row # Since DATE is in 'year mo da', if there's only
> > one date, sum will
> be
> > less than 2100000:
> > rag.t <- rag.s [ rag.s > 21000000 ]
> > multi.dates <- rownames ( rag.t )                         # all the
> IDs
> > with >1 date
> > rag.am <- rag.a [ multi.dates ]                           # rag.am
> only
> > has IDs with > 1 Date
> >
> >
> > # But now I'm stuck.
> > # Each row of the array is rag.am$ID.
> > # So I can't pick columns of DATEs from the ragged array.
> >
> > This message and any attachment are intended solely for the
> > addressee and may contain confidential information. If you have
> > received this message in error, please send it back to me, and
> > immediately delete
> it.
> > Please do not use, copy or disclose the information contained in
> > this message or in any attachment.  Any views or opinions expressed
> > by the author of this email do not necessarily reflect the views of
> > the University of Nottingham.
> >
> > This message has been checked for viruses but the contents of an
> > attachment may still contain software viruses which could damage
> > your computer system:
> > you are advised to perform your own checks. Email communications
> > with the University of Nottingham may be monitored as permitted by
> > UK legislation.
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html and provide commented, minimal, self-contained,
> reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

Rui Barradas
Hello,

Inline.
Em 23-10-2012 14:53, Stuart Leask escreveu:

> I too had a parsimonious solution that was also fooled by IDs that had a duplicate date that wasn't the first date, but was the same as another ID's duplicate+first.
>
> The right answer
> >From this data:
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>   ,20091224,20050503,19870508,19870508,19880330)
>
>   id.d <- cbind (ID,DATE )
>
> is:
>
> 167, 841  and 1019 - correct.
> 814 910 - incorrect. Although there are duplicate dates, they are not the first date.
>
> -----Original Message-----
> From: arun [mailto:[hidden email]]
> Sent: 23 October 2012 14:29
> To: Stuart Leask
> Cc: R help
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi,
> Also one more thing:
> This should get the dates which are duplicated.  In my first reply, I was looking for the duplicated rows. Sorry for that!
>
> id.d<-data.frame(ID,DATE)
>
> new1<-id.d[duplicated(id.d$DATE)|duplicated(id.d$DATE,fromLast=TRUE),]
>
>
> new2<-new1[order(new1$ID,new1$DATE),]
>   tapply(new2$ID,new2$DATE,head,1)
> #19870508 20040205 20040429 20050421
>    #   910      167      814      841
>
> But, still the result is not that you wanted, because 910's date is the earliest date when compared to 1019.
> new1[order(new1$ID,new1$DATE),]
> #     ID     DATE
> #5   167 20040205
> #6   167 20040205
> #18  814 20040429
> #19  814 20040429
> #22  841 20050421
> #23  841 20050421
> #31  910 19870508
> #32  910 20040205
> #33  910 20040205
> #38 1019 19870508
> #39 1019 19870508
>
> A.K.
>
> ----- Original Message -----
> From: Stuart Leask <[hidden email]>
> To: arun <[hidden email]>
> Cc: Petr PIKAL <[hidden email]>
> Sent: Tuesday, October 23, 2012 9:15 AM
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Sorry Arun, but when I run it I get an error:
>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> + ,1019)
>> DATE <-
> +  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> +  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> +  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> +  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> +  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
> +  ,20091224,20050503,19870508,19870508,19880330)
>>   id.d <- cbind (ID,DATE )
>> new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
>>
>>
>> tapply(new1$ID,new1$DATE,head,1)
> Error in new1$DATE : $ operator is invalid for atomic vectors

The error comes from the fact that id.d is a matrix, Arun is using one
of the list or data.frame ways of accessing the elements. Try new1[,
"ID"] and new1[, "DATE"].
Anyway I believe the solution will give all duplicates' first rows, not
the first rows of the duplicates in first row of each ID.

Rui Barradas

>
>
>
>
> -----Original Message-----
> From: arun [mailto:[hidden email]]
> Sent: 23 October 2012 14:05
> To: Stuart Leask
> Cc: R help; Petr PIKAL
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> HI,
> I was not following the thread.
> May be this is what you are looking for:
> new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
>
>
> tapply(new1$ID,new1$DATE,head,1)
> #19870508 20040205 20040429 20050421
>    #  1019      167      814      841
> A.K.
>
>
>
>
> ----- Original Message -----
> From: Stuart Leask <[hidden email]>
> To: PIKAL Petr <[hidden email]>; "[hidden email]" <[hidden email]>
> Cc:
> Sent: Tuesday, October 23, 2012 8:28 AM
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hi there.
>
> Not sure I follow what you are doing.
>
> I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.
>
> I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
> ,20091224,20050503,19870508,19870508,19880330)
>
> Correct output:
> "167"  "841"  "1019"
>
> Stuart
>
> -----Original Message-----
> From: PIKAL Petr [mailto:[hidden email]]
> Sent: 23 October 2012 13:15
> To: Stuart Leask; [hidden email]
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi
>
> Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date
>
> fff<-function(data, first=TRUE, remove=FALSE) {
>
> testfirst <- function(x) x[1,2]==x[2,2]
> testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
>
> if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))
>
> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>
>
>> fff(id.d)
>      ID     DATE
> 31 910 20091105
> 32 910 20091105
> 33 910 20091117
> 34 910 20091119
> 35 910 20091120
> 36 910 20091210
> 37 910 20091224
> 38 910 20091224
>
>> fff(id.d, remove=T)
>       ID     DATE
> 1    58 20060821
> 2    58 20061207
> 3    58 20080102
> 4    58 20090904
> 5   167 20040205
> 6   167 20040323
> 7   323 20051111
> 8   323 20060111
> 9   323 20071119
> 10  323 20080107
> 11  323 20080407
> 12  323 20080521
> 13  323 20080711
> 14  547 20041005
> 15  794 20070905
> 16  814 20020814
> 17  814 20021125
> 18  814 20040429
> 19  814 20040429
> 20  814 20071205
> 21  814 20080227
> 22  841 20050421
> 23  841 20060130
> 24  841 20060428
> 25  841 20060602
> 26  841 20060816
> 27  841 20061025
> 28  841 20061129
> 29  841 20070112
> 30  841 20070514
> 39  999 20050503
> 40 1019 19870508
> 41 1019 19880223
> 42 1019 19880330
> 43 1019 19880330
> Regards
> Petr
>
>
>> -----Original Message-----
>> From: [hidden email] [mailto:r-help-bounces@r-
>> project.org] On Behalf Of PIKAL Petr
>> Sent: Tuesday, October 23, 2012 1:49 PM
>> To: Stuart Leask; [hidden email]
>> Subject: Re: [R] [r] How to pick colums from a ragged array?
>>
>> Hi
>>
>> I did not check your code and rather followed your explanation. BTW,
>> thanks for test data.
>>
>> small change in data frame to make DATE as Date class
>>
>> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
>> data.frame(ID,datum )
>>
>> ordering by date
>>
>> id.d<-id.d[order(id.d$datum),]
>>
>>
>> two functions to test if first two dates are the same or last two
>> dates are the same
>>
>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>> x[length(x),2]==x[length(x)-1,2]
>>
>> change one last date in the data frame to be the same as previous
>>
>> id.d[35,2]<-id.d[36,2]
>>
>> and here are results
>>
>> sapply(split(id.d, id.d$ID), testlast)
>>     58   167   323   547   794   814   841   910   999  1019  FALSE
>> FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>>
>>> sapply(split(id.d, id.d$ID), testfirst)
>>     58   167   323   547   794   814   841   910   999  1019  FALSE
>> FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>>
>> Now you can select ID which is true and remove it from your data
>> which(sapply(split(id.d, id.d$ID), testlast))
>>
>> and use it for your data frame to subset/remove id.d$ID ==
>> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>>
>> However I am not sure if this is exactly what you want.
>>
>> Regards
>> Petr
>>
>>> -----Original Message-----
>>> From: [hidden email] [mailto:r-help-bounces@r-
>>> project.org] On Behalf Of Stuart Leask
>>> Sent: Tuesday, October 23, 2012 11:38 AM
>>> To: [hidden email]
>>> Subject: [R] [r] How to pick colums from a ragged array?
>>>
>>> I have a large dataset (~1 million rows) of three variables: ID
>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>>> date).
>>> Patients may have been assigned more than one diagnosis at any one
>>> appointment - leading to two rows, same ID and DATE but different
>>> DIAGNOSIS.
>>> The diagnoses may change between appointments.
>>>
>>> I want to subset the data in two ways:
>>>
>>> -          define groups of patients by the first diagnosis given
>>>
>>> -          define groups of patients by the last diagnosis given.
>>>
>>> The problem:
>>> Unfortunately, a small number of patients have been given more than
>>> one diagnosis at their first (or last) appointment. These
>>> individuals I need to identify and remove, as it's not possible to
>>> say uniquely what their first (or last) diagnosis was. So I need to
>>> identify and remove these individuals which have pairs of rows with
>>> the same ID
>> and
>>> (lowest or highest) DATE. The size of the dataset precludes the
>> option
>>> of doing this by eye.
>>>
>>> I suspect there is a very elegant way of doing this in R.
>>>
>>> This is what I've come up with:
>>>
>>>
>>> -          Sort by DATE then ID
>>>
>>> -          Make a ragged array of DATE by ID
>>>
>>> -          Remove IDs that only occur once.
>>>
>>> -          Subtract the first and second DATEs. Remove IDs for which
>>> this = zero, as this will only be true for IDs for which the
>>> appointment is recorded twice (because there were two diagnoses
>>> recorded on this date).
>>>
>>> -          (Then do the same to get the 'last appointment'
>> duplicates,
>>> by reversing the initial sort by DATE.)
>>>
>>> I am stuck at the 'Subtract dates' step: I would like to get the
>>> data out of the ragged array by columns (so e.g. I end up with a
>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
>>> column from the ragged array.
>>>
>>> I hope someone can help. My ugly code is below, with some data for
>>> testing.
>>>
>>>
>>> Stuart
>>>
>>>
>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
>>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental
>>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
>>> Tel. +44
>>> 115 82 30419
>>> [hidden email]<mailto:[hidden email]>
>>> Google 'Dr Stuart Leask'
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>> ,20091224,20050503,19870508,19880223,19880330)
>>>
>>> id.d <- cbind (ID,DATE )
>>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>>> ragged array, 1-n DATES for every NAME
>>>
>>> # Inelegant attempt to remove IDs that only have one entry:
>>>
>>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
>>> the dates per row # Since DATE is in 'year mo da', if there's only
>>> one date, sum will
>> be
>>> less than 2100000:
>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs
>>> with >1 date
>>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only
>>> has IDs with > 1 Date
>>>
>>>
>>> # But now I'm stuck.
>>> # Each row of the array is rag.am$ID.
>>> # So I can't pick columns of DATEs from the ragged array.
>>>
>>> This message and any attachment are intended solely for the
>>> addressee and may contain confidential information. If you have
>>> received this message in error, please send it back to me, and
>>> immediately delete
>> it.
>>> Please do not use, copy or disclose the information contained in
>>> this message or in any attachment.  Any views or opinions expressed
>>> by the author of this email do not necessarily reflect the views of
>>> the University of Nottingham.
>>>
>>> This message has been checked for viruses but the contents of an
>>> attachment may still contain software viruses which could damage
>>> your computer system:
>>> you are advised to perform your own checks. Email communications
>>> with the University of Nottingham may be monitored as permitted by
>>> UK legislation.
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>> guide.html and provide commented, minimal, self-contained,
>>> reproducible code.
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html and provide commented, minimal, self-contained,
>> reproducible code.
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.
>
> This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system:
> you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by Stuart Leask
Hi Stuart,

This also should get you the IDs you wanted.
new1<-id.d[duplicated(id.d[,2])|duplicated(id.d[,2],fromLast=TRUE),]
earliest <- tapply ( DATE, ID, min)               
 rownames(earliest[earliest%in% new1])
#[1] "167"  "841"  "1019"
A.K.






----- Original Message -----
From: Stuart Leask <[hidden email]>
To: Rui Barradas <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Sent: Tuesday, October 23, 2012 7:37 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!)
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)               
rownames(earliest[earliest%in%nd.b]) 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by Stuart Leask


Hi,

 res1<-data.frame(col=sapply(tapply(DATE,ID,function(x)  duplicated(head(x,2))),function(x) x[2]))
row.names(subset(res1,col==TRUE))
#[1] "167"  "841"  "1019"
#assuming that dates are sorted

A.K.


----- Original Message -----
From: Stuart Leask <[hidden email]>
To: Rui Barradas <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.


-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: [hidden email]
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)               
rownames(earliest[earliest%in%nd.b]) 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [r] How to pick colums from a ragged array?

arun kirshna
In reply to this post by Stuart Leask
Hi,
I read ur first post and I think I understand what you meant. 
I guess this should work.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
 id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,]),]
 #      ID     DATE
 #[1,]  167 20040205
 #[2,]  167 20040205
 #[3,]  841 20050421
 #[4,]  841 20050421
 #[5,]  841 20060428
 #[6,]  841 20060602
 #[7,]  841 20060816
 #[8,]  841 20061025
 #[9,]  841 20061129
#[10,]  841 20070112
#[11,]  841 20070514
#[12,] 1019 19870508
#[13,] 1019 19870508
#[14,] 1019 19880330

So, these IDs should be deleted if I am correct.
A.K.





----- Original Message -----
From: Stuart Leask <[hidden email]>
To: Rui Barradas <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart


ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.


-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: [hidden email]
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)               
rownames(earliest[earliest%in%nd.b]) 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!


Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).


Stuart


************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...




******************************************************************



-----Original Message-----
From: Rui Barradas [mailto:[hidden email]]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: [hidden email]
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)


getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones


Hope this helps,

Rui Barradas


Em 23-10-2012 10:59, Rui Barradas escreveu:

> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> [hidden email]<mailto:[hidden email]>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained
>> in this message or in any attachment.  Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
12