

I have a large dataset (~50,000 rows, 96 columns), of hospital
administrative data.
many of the columns are clinical coding of inpatient event (using ICD10).
A simplified example of the data is below
> dput(dat_unmatched)
structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
"BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
3L, 1L, 1L), .Label = c("B83.2", "C23.2", "F56.23"), class = "factor"),
X.2 = structure(c(2L, 1L, 2L, 2L), .Label = c("M20.64", "T43.2"
), class = "factor"), X.3 = structure(c(2L, 3L, 3L, 1L), .Label =
c("F56.23",
"R23.1", "Y32.1"), class = "factor"), X.4 = structure(c(1L,
2L, 2L, 3L), .Label = c("M23.5", "T44.2", "Y32.1"), class = "factor"),
X.5 = structure(c(1L, 2L, 1L, 2L), .Label = c("", "Q23.6"
), class = "factor")), .Names = c("ID", "X.1", "X.2", "X.3",
"X.4", "X.5"), class = "data.frame", row.names = c(NA, 4L))
I am interested in a set of codes that start with a "T" or a "Y", and link
them to the preceding column that does not begin with a "T" or "Y". I
suspect I will need to use regular expressions, and likely a loop, but I am
really out of my depth at this point.
I would like the final dataset to look like:
> dput(dat_matched)
structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
"BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
3L, 1L, 1L), .Label = c("B83.2", "C23.2", "M20.64"), class = "factor"),
X.2 = structure(c(1L, 2L, 1L, 1L), .Label = c("T43.2", "Y32.1"
), class = "factor"), X.3 = structure(c(1L, 4L, 2L, 3L), .Label = c("",
"B83.2", "F56.23", "M20.64"), class = "factor"), X.4 = structure(c(1L,
2L, 3L, 3L), .Label = c("", "T44.2", "Y32.1"), class = "factor"),
X.5 = structure(c(1L, 1L, 2L, 1L), .Label = c("", "B83.2"
), class = "factor"), X = structure(c(1L, 1L, 2L, 1L), .Label = c("",
"T44.2"), class = "factor")), .Names = c("ID", "X.1", "X.2",
"X.3", "X.4", "X.5", "X"), class = "data.frame", row.names = c(NA,
4L))
Any help appreciated.
Matthew
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


"out of your depth" does not serve as a legitimate excuse  for me
anyway. There are many good tutorials on regular expressions out
there. Go through one. Ditto with R data handling. "An Introduction to
R" (ships with R) is one that's right at hand.
Although others may be more inclined than I am to help, you would
certainly increase the likelihood by first doing some homework and
showing us code that you tried. Although, by that time, you probably
will have figured it out for yourself.
Cheers,
Bert
Bert Gunter
Genentech Nonclinical Biostatistics
(650) 4677374
"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll
On Wed, Dec 17, 2014 at 12:14 PM, Robert Strother < [hidden email]> wrote:
> I have a large dataset (~50,000 rows, 96 columns), of hospital
> administrative data.
> many of the columns are clinical coding of inpatient event (using ICD10).
> A simplified example of the data is below
>
>> dput(dat_unmatched)
> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "F56.23"), class = "factor"),
> X.2 = structure(c(2L, 1L, 2L, 2L), .Label = c("M20.64", "T43.2"
> ), class = "factor"), X.3 = structure(c(2L, 3L, 3L, 1L), .Label =
> c("F56.23",
> "R23.1", "Y32.1"), class = "factor"), X.4 = structure(c(1L,
> 2L, 2L, 3L), .Label = c("M23.5", "T44.2", "Y32.1"), class = "factor"),
> X.5 = structure(c(1L, 2L, 1L, 2L), .Label = c("", "Q23.6"
> ), class = "factor")), .Names = c("ID", "X.1", "X.2", "X.3",
> "X.4", "X.5"), class = "data.frame", row.names = c(NA, 4L))
>
> I am interested in a set of codes that start with a "T" or a "Y", and link
> them to the preceding column that does not begin with a "T" or "Y". I
> suspect I will need to use regular expressions, and likely a loop, but I am
> really out of my depth at this point.
>
> I would like the final dataset to look like:
>
>> dput(dat_matched)
> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "M20.64"), class = "factor"),
> X.2 = structure(c(1L, 2L, 1L, 1L), .Label = c("T43.2", "Y32.1"
> ), class = "factor"), X.3 = structure(c(1L, 4L, 2L, 3L), .Label = c("",
> "B83.2", "F56.23", "M20.64"), class = "factor"), X.4 = structure(c(1L,
> 2L, 3L, 3L), .Label = c("", "T44.2", "Y32.1"), class = "factor"),
> X.5 = structure(c(1L, 1L, 2L, 1L), .Label = c("", "B83.2"
> ), class = "factor"), X = structure(c(1L, 1L, 2L, 1L), .Label = c("",
> "T44.2"), class = "factor")), .Names = c("ID", "X.1", "X.2",
> "X.3", "X.4", "X.5", "X"), class = "data.frame", row.names = c(NA,
> 4L))
>
> Any help appreciated.
>
> Matthew
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Not sure how much help it will be but there is a package on CRAN called
icd9. Although clearly the codes are different in ICD 10 it may give you
some hints. I suppose you could even email the maintainer to see whether
there is an icd10 in the pipeline.
On 17/12/2014 20:14, Robert Strother wrote:
> I have a large dataset (~50,000 rows, 96 columns), of hospital
> administrative data.
> many of the columns are clinical coding of inpatient event (using ICD10).
> A simplified example of the data is below
>
>> dput(dat_unmatched)
> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "F56.23"), class = "factor"),
> X.2 = structure(c(2L, 1L, 2L, 2L), .Label = c("M20.64", "T43.2"
> ), class = "factor"), X.3 = structure(c(2L, 3L, 3L, 1L), .Label =
> c("F56.23",
> "R23.1", "Y32.1"), class = "factor"), X.4 = structure(c(1L,
> 2L, 2L, 3L), .Label = c("M23.5", "T44.2", "Y32.1"), class = "factor"),
> X.5 = structure(c(1L, 2L, 1L, 2L), .Label = c("", "Q23.6"
> ), class = "factor")), .Names = c("ID", "X.1", "X.2", "X.3",
> "X.4", "X.5"), class = "data.frame", row.names = c(NA, 4L))
>
> I am interested in a set of codes that start with a "T" or a "Y", and link
> them to the preceding column that does not begin with a "T" or "Y". I
> suspect I will need to use regular expressions, and likely a loop, but I am
> really out of my depth at this point.
>
> I would like the final dataset to look like:
>
>> dput(dat_matched)
> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "M20.64"), class = "factor"),
> X.2 = structure(c(1L, 2L, 1L, 1L), .Label = c("T43.2", "Y32.1"
> ), class = "factor"), X.3 = structure(c(1L, 4L, 2L, 3L), .Label = c("",
> "B83.2", "F56.23", "M20.64"), class = "factor"), X.4 = structure(c(1L,
> 2L, 3L, 3L), .Label = c("", "T44.2", "Y32.1"), class = "factor"),
> X.5 = structure(c(1L, 1L, 2L, 1L), .Label = c("", "B83.2"
> ), class = "factor"), X = structure(c(1L, 1L, 2L, 1L), .Label = c("",
> "T44.2"), class = "factor")), .Names = c("ID", "X.1", "X.2",
> "X.3", "X.4", "X.5", "X"), class = "data.frame", row.names = c(NA,
> 4L))
>
> Any help appreciated.
>
> Matthew
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
>
> 
> No virus found in this message.
> Checked by AVG  www.avg.com
> Version: 2015.0.5577 / Virus Database: 4253/8759  Release Date: 12/18/14
>
>

Michael
http://www.dewey.myzen.co.uk______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I do not think that you need regular expressions for your problem.
Please see the below:
> d0 < dat_unmatched
> tmp < apply(d0, 1, function(x){
+ first < substr(x,1,1)
+ idx < which(c("T", "Y") == first)
+ comb < paste(x[idx[1]1], x[idx], collapse=" ")
+ unlist(strsplit(comb, " "))
+ })
> names(tmp) < d0$ID
> tmp
$MCZ4325
[1] "C23.2" "T43.2"
$GDR2343
[1] "M20.64" "Y32.1" "M20.64" "T44.2"
$BZD2643
[1] "B83.2" "T43.2" "B83.2" "Y32.1" "B83.2" "T44.2"
$BCM3455
[1] "B83.2" "T43.2"
Is this what you are looking for? I hope this helps.
Chel Hee Lee
On 12/18/2014 07:41 AM, Michael Dewey wrote:
> Not sure how much help it will be but there is a package on CRAN called
> icd9. Although clearly the codes are different in ICD 10 it may give you
> some hints. I suppose you could even email the maintainer to see whether
> there is an icd10 in the pipeline.
>
> On 17/12/2014 20:14, Robert Strother wrote:
>> I have a large dataset (~50,000 rows, 96 columns), of hospital
>> administrative data.
>> many of the columns are clinical coding of inpatient event (using
>> ICD10).
>> A simplified example of the data is below
>>
>>> dput(dat_unmatched)
>> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
>> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 =
>> structure(c(2L,
>> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "F56.23"), class = "factor"),
>> X.2 = structure(c(2L, 1L, 2L, 2L), .Label = c("M20.64", "T43.2"
>> ), class = "factor"), X.3 = structure(c(2L, 3L, 3L, 1L), .Label =
>> c("F56.23",
>> "R23.1", "Y32.1"), class = "factor"), X.4 = structure(c(1L,
>> 2L, 2L, 3L), .Label = c("M23.5", "T44.2", "Y32.1"), class =
>> "factor"),
>> X.5 = structure(c(1L, 2L, 1L, 2L), .Label = c("", "Q23.6"
>> ), class = "factor")), .Names = c("ID", "X.1", "X.2", "X.3",
>> "X.4", "X.5"), class = "data.frame", row.names = c(NA, 4L))
>>
>> I am interested in a set of codes that start with a "T" or a "Y", and
>> link
>> them to the preceding column that does not begin with a "T" or "Y". I
>> suspect I will need to use regular expressions, and likely a loop, but
>> I am
>> really out of my depth at this point.
>>
>> I would like the final dataset to look like:
>>
>>> dput(dat_matched)
>> structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
>> "BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 =
>> structure(c(2L,
>> 3L, 1L, 1L), .Label = c("B83.2", "C23.2", "M20.64"), class = "factor"),
>> X.2 = structure(c(1L, 2L, 1L, 1L), .Label = c("T43.2", "Y32.1"
>> ), class = "factor"), X.3 = structure(c(1L, 4L, 2L, 3L), .Label =
>> c("",
>> "B83.2", "F56.23", "M20.64"), class = "factor"), X.4 =
>> structure(c(1L,
>> 2L, 3L, 3L), .Label = c("", "T44.2", "Y32.1"), class = "factor"),
>> X.5 = structure(c(1L, 1L, 2L, 1L), .Label = c("", "B83.2"
>> ), class = "factor"), X = structure(c(1L, 1L, 2L, 1L), .Label =
>> c("",
>> "T44.2"), class = "factor")), .Names = c("ID", "X.1", "X.2",
>> "X.3", "X.4", "X.5", "X"), class = "data.frame", row.names = c(NA,
>> 4L))
>>
>> Any help appreciated.
>>
>> Matthew
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide
>> http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>>
>>
>> 
>> No virus found in this message.
>> Checked by AVG  www.avg.com
>> Version: 2015.0.5577 / Virus Database: 4253/8759  Release Date: 12/18/14
>>
>>
>
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
Chel Hee Lee

