Hi R users,
I really need help with subsetting data frames: I have a large database of medical records and I want to be able to match patterns from a list of search terms . I've used this simplified data frame in a previous example: db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1, 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28, 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class = "data.frame", row.names = c(NA, -4L)) terms_include <- c("1","2","3") terms_exclude <- c("1.1","1.2","1.3") So in this example I want to include all the terms from terms include as long as they don't occur with terms exclude in the same row of the data frame. Previously I was given this function which works very well if you want to match exactly: f <- function(x) !any(x %in% terms_exclude) && any(x %in% terms_include) db[apply(db[, -1], 1, f), ] ind test1 test2 test3 2 ind2 2 27 28.0 4 ind4 3 2 1.2 I would like to know if there is a way to write a similar function that looks for matches that start with the query string: as in grepl("^pattern",x) I started writing a function but am not sure how to get it to return the dataframe or matrix: for (i in 1:length(terms_include)){ db_new <- apply(db,2, grepl,pattern=i) } Applying this function gives me: db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L, 4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3" ))) So the above is searching the pattern anywhere in the dataframe instead of just at the beginning of the string. How would I incorporate look for terms to include but don't return the row of the data frame if it also includes one of the terms to exclude while using partial matching? I hope that this makes sense. Many thanks, Natalie
Natalie Van Zuydam
PhD Student University of Dundee nvanzuydam@dundee.ac.uk |
does this do what you want:
> db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1, + 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28, + 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class = + "data.frame", row.names = c(NA, + -4L)) > > terms_include <- c("1","2","3") > terms_exclude <- c("1.1","1.2","1.3") > > f.match <- function(obj, inc, exc){ + pat <- paste("^(", paste(inc, collapse = "|"), ")", sep = '') + patex <- paste(exc, collapse = "|") + isMatch <- apply(obj, 1, function(x) any(grepl(pat, x))) + notMatch <- !apply(obj, 1, function(x) any(grepl(patex, x))) + obj[isMatch & notMatch,] + } > > db ind test1 test2 test3 1 ind1 1.0 56 1.1 2 ind2 2.0 27 28.0 3 ind3 1.3 58 9.0 4 ind4 3.0 2 1.2 > f.match(db, terms_include, terms_exclude) ind test1 test2 test3 2 ind2 2 27 28 > On Mon, Dec 5, 2011 at 6:32 AM, natalie.vanzuydam <[hidden email]> wrote: > Hi R users, > > I really need help with subsetting data frames: > > I have a large database of medical records and I want to be able to match > patterns from a list of search terms . > > I've used this simplified data frame in a previous example: > > > db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1, > 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28, > 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class = > "data.frame", row.names = c(NA, > -4L)) > > terms_include <- c("1","2","3") > terms_exclude <- c("1.1","1.2","1.3") > > > So in this example I want to include all the terms from terms include as > long as they don't occur with terms exclude in the same row of the data > frame. > > Previously I was given this function which works very well if you want to > match exactly: > > > f <- function(x) !any(x %in% terms_exclude) && any(x %in% terms_include) > db[apply(db[, -1], 1, f), ] > > ind test1 test2 test3 > 2 ind2 2 27 28.0 > 4 ind4 3 2 1.2 > > > I would like to know if there is a way to write a similar function that > looks for matches that start with the query string: as in > grepl("^pattern",x) > > I started writing a function but am not sure how to get it to return the > dataframe or matrix: > > > for (i in 1:length(terms_include)){ > db_new <- apply(db,2, grepl,pattern=i) > } > > Applying this function gives me: > > db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L, > 4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3" > ))) > > So the above is searching the pattern anywhere in the dataframe instead of > just at the beginning of the string. > > How would I incorporate look for terms to include but don't return the row > of the data frame if it also includes one of the terms to exclude while > using partial matching? > > I hope that this makes sense. > > Many thanks, > Natalie > > ----- > Natalie Van Zuydam > > PhD Student > University of Dundee > [hidden email] > -- > View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-tp4160127p4160127.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |