|
Hi,
First of all, I would like to introduce myself as I will probably have many questions over the next few weeks and want to thank you guys in advance for your help. I'm a cancer researcher and I need to learn R to complete a few projects. I have an introductory background in Python. My questions at the moment are based on the following sample input file: *Sample_Input_File* characteristics_ch1.3 Stage: T1N0 Stage: T2N1 Stage: T0N0 Stage: T1N0 Stage: T0N3 "characteristics_ch1.3" is a column header in the input excel file. "T's" represent stage and "N's" represent degree of disease spreading. I want to create output that looks like this: *Sample_Output_File* T N 1 0 2 1 0 0 1 0 0 3 As it currently stands, my code is the following: rm(list=ls()) source("../../functions.R") uncurated <- read.csv("../uncurated/Sample_Input_File_full_pdata.csv",as.is =TRUE,row.names=1) ##initial creation of curated dataframe curated <- initialCuratedDF(rownames(uncurated),template.filename="Sample_Template_File.csv") ##-------------------- ##start the mappings ##-------------------- ##title -> alt_sample_name curated$alt_sample_name <- uncurated$title #T tmp <- uncurated$characteristics_ch1.3 tmp <- *??????* curated$T <- tmp #N tmp <- uncurated$characteristics_ch1.3 tmp <- *??????* curated$N <- tmp write.table(curated, row.names=FALSE, file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t") My question is the following: What code gets me the desired output (replacing the *??????*'s above)? I want to: a) Find the integer value one element to the right of "T"; and b) find the integer value one element to the right of "N". I've read the regular expression tutorial for R, but could only figure out how to grab an integer value if it is the only integer value in the row (ie more than one integer value makes this basic regular expression unsuccessful). Thank you very much for any help you can provide. Sincerely, Ben Ganzfried [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Jun 2, 2011, at 2:54 PM, Ben Ganzfried wrote: > Hi, > > First of all, I would like to introduce myself as I will probably > have many > questions over the next few weeks and want to thank you guys in > advance for > your help. I'm a cancer researcher and I need to learn R to > complete a few > projects. I have an introductory background in Python. > > My questions at the moment are based on the following sample input > file: > *Sample_Input_File* > characteristics_ch1.3 Stage: T1N0 Stage: T2N1 Stage: T0N0 Stage: > T1N0 Stage: > T0N3 > I haven't quite figured out what your structure really is, and for that you should learn to post the output of dput() on the R object... but see if this helps: > stg <- c('Stage: T1N0', 'Stage: T2N1', 'Stage: T0N0', 'Stage: T1N0', 'Stage: T0N3') > Tstg <- sub(".*T(\\d)N.", "\\1", stg) > Tstg #[1] "1" "2" "0" "1" "0" > Nstg <- sub(".*T\\dN(\\d)", "\\1", stg) > Nstg #[1] "0" "1" "0" "0" "3" > "characteristics_ch1.3" is a column header in the input excel file. > > "T's" represent stage and "N's" represent degree of disease spreading. > > I want to create output that looks like this: > *Sample_Output_File* > T N > 1 0 > 2 1 > 0 0 > 1 0 > 0 3 > > As it currently stands, my code is the following: > > # rm(list=ls()) ####---- AND PLEASE DON"T POST THAT CODE WITHOUT A COMMENT. I noticed it this time, but it is very aggravating to accidentally wide out hours of work while trying to offer help. > source("../../functions.R") > > uncurated <- read.csv("../uncurated/ > Sample_Input_File_full_pdata.csv",as.is > =TRUE,row.names=1) > > ##initial creation of curated dataframe > curated <- > initialCuratedDF > (rownames(uncurated),template.filename="Sample_Template_File.csv") > > ##-------------------- > ##start the mappings > ##-------------------- > > > ##title -> alt_sample_name > curated$alt_sample_name <- uncurated$title > > #T > tmp <- uncurated$characteristics_ch1.3 > tmp <- *??????* > curated$T <- tmp So here Tstg is tmp > > #N > tmp <- uncurated$characteristics_ch1.3 > tmp <- *??????* > curated$N <- tmp And Nstg is tmp > write.table(curated, row.names=FALSE, > file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t") > > My question is the following: > > What code gets me the desired output (replacing the *??????*'s > above)? I > want to: a) Find the integer value one element to the right of "T"; > and b) > find the integer value one element to the right of "N". I've read the > regular expression tutorial for R, but could only figure out how to > grab an > integer value if it is the only integer value in the row (ie more > than one > integer value makes this basic regular expression unsuccessful). Just surround it with a pattern and use the () , "\\n" mechanism > > Thank you very much for any help you can provide. > > Sincerely, > > Ben Ganzfried > > [[alternative HTML version deleted]] David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thank you very much for your help. It saved me a lot of time and it worked
perfectly. I have a quick follow-up as I'm not sure I understand yet why the code works and where it comes from. For example, in: Tstg <- sub(".*T(\\d)N.", "\\1", tmp) *How exactly does the substitution operation work? *On a high-level, I get that we are taking the values in the vector tmp, and replacing each tmp value with the integer immediately after the "T". But more lower-level, how does ".*T(\\d)N.", "\\1" actually get us there? I'll undoubtedly face similar but different situations many times in the future and I want to make sure that I know how to solve them. Thanks again--I really appreciate your kindness. Ben Ganzfried On Thu, Jun 2, 2011 at 3:33 PM, David Winsemius <[hidden email]>wrote: > > On Jun 2, 2011, at 2:54 PM, Ben Ganzfried wrote: > > Hi, >> >> First of all, I would like to introduce myself as I will probably have >> many >> questions over the next few weeks and want to thank you guys in advance >> for >> your help. I'm a cancer researcher and I need to learn R to complete a >> few >> projects. I have an introductory background in Python. >> >> My questions at the moment are based on the following sample input file: >> *Sample_Input_File* >> characteristics_ch1.3 Stage: T1N0 Stage: T2N1 Stage: T0N0 Stage: >> T1N0 Stage: >> T0N3 >> >> > I haven't quite figured out what your structure really is, and for that you > should learn to post the output of dput() on the R object... but see if > this helps: > > > stg <- c('Stage: T1N0', 'Stage: T2N1', 'Stage: T0N0', 'Stage: T1N0', > 'Stage: T0N3') > > Tstg <- sub(".*T(\\d)N.", "\\1", stg) > > Tstg > #[1] "1" "2" "0" "1" "0" > > Nstg <- sub(".*T\\dN(\\d)", "\\1", stg) > > Nstg > #[1] "0" "1" "0" "0" "3" > > > "characteristics_ch1.3" is a column header in the input excel file. >> >> "T's" represent stage and "N's" represent degree of disease spreading. >> >> I want to create output that looks like this: >> *Sample_Output_File* >> T N >> 1 0 >> 2 1 >> 0 0 >> 1 0 >> 0 3 >> >> As it currently stands, my code is the following: >> >> > > > # rm(list=ls()) >> > ####---- > AND PLEASE DON"T POST THAT CODE WITHOUT A COMMENT. > > I noticed it this time, but it is very aggravating to accidentally wide out > hours of work while trying to offer help. > > source("../../functions.R") >> >> uncurated <- read.csv("../uncurated/Sample_Input_File_full_pdata.csv", >> as.is >> =TRUE,row.names=1) >> >> ##initial creation of curated dataframe >> curated <- >> >> initialCuratedDF(rownames(uncurated),template.filename="Sample_Template_File.csv") >> >> ##-------------------- >> ##start the mappings >> ##-------------------- >> >> >> ##title -> alt_sample_name >> curated$alt_sample_name <- uncurated$title >> >> #T >> tmp <- uncurated$characteristics_ch1.3 >> tmp <- *??????* >> curated$T <- tmp >> > > So here Tstg is tmp > >> >> #N >> tmp <- uncurated$characteristics_ch1.3 >> tmp <- *??????* >> curated$N <- tmp >> > And Nstg is tmp > > write.table(curated, row.names=FALSE, >> file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t") >> >> My question is the following: >> >> What code gets me the desired output (replacing the *??????*'s above)? I >> want to: a) Find the integer value one element to the right of "T"; and b) >> find the integer value one element to the right of "N". I've read the >> regular expression tutorial for R, but could only figure out how to grab >> an >> integer value if it is the only integer value in the row (ie more than one >> integer value makes this basic regular expression unsuccessful). >> > > Just surround it with a pattern and use the () , "\\n" mechanism > >> >> Thank you very much for any help you can provide. >> >> Sincerely, >> >> Ben Ganzfried >> >> [[alternative HTML version deleted]] >> > > > David Winsemius, MD > West Hartford, CT > > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Jun 2, 2011, at 4:21 PM, Ben Ganzfried wrote: > Thank you very much for your help. It saved me a lot of time and it > worked perfectly. I have a quick follow-up as I'm not sure I > understand yet why the code works and where it comes from. > > For example, in: Tstg <- sub(".*T(\\d)N.", "\\1", tmp) > > How exactly does the substitution operation work? > > On a high-level, I get that we are taking the values in the vector > tmp, and replacing each tmp value with the integer immediately after > the "T". But more lower-level, how does ".*T(\\d)N.", "\\1" > actually get us there? I'll undoubtedly face similar but different > situations many times in the future and I want to make sure that I > know how to solve them. The parentheses in the first pattern encloses the portion that can be referred to with "\\1" in the second argument. Since I only enclosed the \\d (which is a single digit), that's what got "substituted" for the entire matched pattern. The initial part of the pattern was <dot><star> == ".*" which will match anything (or nothing) before the "T", but since it wasn't in the parens, it gets dropped. It's actually all described on the regex page, but it helps to work through some examples to get the hang of it. -- David. > > Thanks again--I really appreciate your kindness. > > Ben Ganzfried > > On Thu, Jun 2, 2011 at 3:33 PM, David Winsemius <[hidden email] > > wrote: > > On Jun 2, 2011, at 2:54 PM, Ben Ganzfried wrote: > > Hi, > > First of all, I would like to introduce myself as I will probably > have many > questions over the next few weeks and want to thank you guys in > advance for > your help. I'm a cancer researcher and I need to learn R to > complete a few > projects. I have an introductory background in Python. > > My questions at the moment are based on the following sample input > file: > *Sample_Input_File* > characteristics_ch1.3 Stage: T1N0 Stage: T2N1 Stage: T0N0 Stage: > T1N0 Stage: > T0N3 > > > I haven't quite figured out what your structure really is, and for > that you should learn to post the output of dput() on the R > object... but see if this helps: > > > stg <- c('Stage: T1N0', 'Stage: T2N1', 'Stage: T0N0', 'Stage: > T1N0', 'Stage: T0N3') > > Tstg <- sub(".*T(\\d)N.", "\\1", stg) > > Tstg > #[1] "1" "2" "0" "1" "0" > > Nstg <- sub(".*T\\dN(\\d)", "\\1", stg) > > Nstg > #[1] "0" "1" "0" "0" "3" > > > "characteristics_ch1.3" is a column header in the input excel file. > > "T's" represent stage and "N's" represent degree of disease spreading. > > I want to create output that looks like this: > *Sample_Output_File* > T N > 1 0 > 2 1 > 0 0 > 1 0 > 0 3 > > As it currently stands, my code is the following: > > > > > # rm(list=ls()) > ####---- > AND PLEASE DON"T POST THAT CODE WITHOUT A COMMENT. > > I noticed it this time, but it is very aggravating to accidentally > wide out hours of work while trying to offer help. > > source("../../functions.R") > > uncurated <- read.csv("../uncurated/ > Sample_Input_File_full_pdata.csv",as.is > =TRUE,row.names=1) > > ##initial creation of curated dataframe > curated <- > initialCuratedDF > (rownames(uncurated),template.filename="Sample_Template_File.csv") > > ##-------------------- > ##start the mappings > ##-------------------- > > > ##title -> alt_sample_name > curated$alt_sample_name <- uncurated$title > > #T > tmp <- uncurated$characteristics_ch1.3 > tmp <- *??????* > curated$T <- tmp > > So here Tstg is tmp > > #N > tmp <- uncurated$characteristics_ch1.3 > tmp <- *??????* > curated$N <- tmp > And Nstg is tmp > > write.table(curated, row.names=FALSE, > file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t") > > My question is the following: > > What code gets me the desired output (replacing the *??????*'s > above)? I > want to: a) Find the integer value one element to the right of "T"; > and b) > find the integer value one element to the right of "N". I've read the > regular expression tutorial for R, but could only figure out how to > grab an > integer value if it is the only integer value in the row (ie more > than one > integer value makes this basic regular expression unsuccessful). > > Just surround it with a pattern and use the () , "\\n" mechanism > > Thank you very much for any help you can provide. > > Sincerely, > > Ben Ganzfried > > [[alternative HTML version deleted]] > > > David Winsemius, MD > West Hartford, CT > > David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Ben Ganzfried
Here is one way you might do it.
> con <- textConnection(" + characteristics_ch1.3 Stage: T1N0 Stage: T2N1 + Stage: T0N0 Stage: T1N0 Stage: T0N3 + ") > txt <- scan(con, what = "") Read 11 items > close(con) > > Ts <- grep("^T", txt, value = TRUE) > Ts <- sub("T([[:digit:]]+)N([[:digit:]]+)", "\\1x\\2", Ts) > out <- do.call(rbind, strsplit(Ts, "x")) > mode(out) <- "numeric" > dimnames(out) <- list(rep("", nrow(out)), c("T", "N")) > > out T N 1 0 2 1 0 0 1 0 0 3 > Now you can print 'out' however you want it, e.g. > sink("outfile.txt") > out > sink() This is slightly more complex than it might be as I have allowed for the possibility that your numbers have more than one digit. Bill Venables. -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Ben Ganzfried Sent: Friday, 3 June 2011 4:54 AM To: [hidden email] Subject: [R] Matrix Question Hi, First of all, I would like to introduce myself as I will probably have many questions over the next few weeks and want to thank you guys in advance for your help. I'm a cancer researcher and I need to learn R to complete a few projects. I have an introductory background in Python. My questions at the moment are based on the following sample input file: *Sample_Input_File* characteristics_ch1.3 Stage: T1N0 Stage: T2N1 Stage: T0N0 Stage: T1N0 Stage: T0N3 "characteristics_ch1.3" is a column header in the input excel file. "T's" represent stage and "N's" represent degree of disease spreading. I want to create output that looks like this: *Sample_Output_File* T N 1 0 2 1 0 0 1 0 0 3 As it currently stands, my code is the following: rm(list=ls()) source("../../functions.R") uncurated <- read.csv("../uncurated/Sample_Input_File_full_pdata.csv",as.is =TRUE,row.names=1) ##initial creation of curated dataframe curated <- initialCuratedDF(rownames(uncurated),template.filename="Sample_Template_File.csv") ##-------------------- ##start the mappings ##-------------------- ##title -> alt_sample_name curated$alt_sample_name <- uncurated$title #T tmp <- uncurated$characteristics_ch1.3 tmp <- *??????* curated$T <- tmp #N tmp <- uncurated$characteristics_ch1.3 tmp <- *??????* curated$N <- tmp write.table(curated, row.names=FALSE, file="../curated/Sample_Output_File_curated_pdata.txt",sep="\t") My question is the following: What code gets me the desired output (replacing the *??????*'s above)? I want to: a) Find the integer value one element to the right of "T"; and b) find the integer value one element to the right of "N". I've read the regular expression tutorial for R, but could only figure out how to grab an integer value if it is the only integer value in the row (ie more than one integer value makes this basic regular expression unsuccessful). Thank you very much for any help you can provide. Sincerely, Ben Ganzfried [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
