Subsetting a data frame

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Subsetting a data frame

natalie.vanzuydam
Hi R users,

I really need help with subsetting  data frames:

I have a large database of medical records and I want to be able to match patterns from a list of search terms .

I've used this simplified data frame in a previous example:


db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class = "data.frame", row.names = c(NA,
-4L))

terms_include <- c("1","2","3")
terms_exclude <- c("1.1","1.2","1.3")


So in this example I want to include all the terms from terms include as long as they don't occur with terms exclude in the same row of the data frame.

Previously I was given this function which works very well if you want to match exactly:


f <- function(x)  !any(x %in% terms_exclude) && any(x %in% terms_include)
db[apply(db[, -1], 1, f), ]

   ind test1 test2 test3
2 ind2     2    27  28.0
4 ind4     3     2   1.2


I would like to know if there is a way to write a similar function that looks for matches that start with the query string:  as in grepl("^pattern",x)  

I started writing a function but am not sure how to get it to return the dataframe or matrix:


for (i in 1:length(terms_include)){
db_new <- apply(db,2, grepl,pattern=i)
}

Applying this function gives me:

db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L,
4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3"
)))

So the above is searching the pattern anywhere in the dataframe instead of just at the beginning of the string.  

How would I incorporate look for terms to include but don't return the row of the data frame if it also includes one of the terms to exclude while using partial matching?

I hope that this makes sense.

Many thanks,
Natalie
Natalie Van Zuydam

PhD Student
University of Dundee
nvanzuydam@dundee.ac.uk
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting a data frame

jholtman
does this do what you want:

> db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
+ 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
+ 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
+ "data.frame", row.names = c(NA,
+ -4L))
>
> terms_include <- c("1","2","3")
> terms_exclude <- c("1.1","1.2","1.3")
>
> f.match <- function(obj, inc, exc){
+     pat <- paste("^(", paste(inc, collapse = "|"), ")", sep = '')
+     patex <- paste(exc, collapse = "|")
+     isMatch <- apply(obj, 1, function(x) any(grepl(pat, x)))
+     notMatch <- !apply(obj, 1, function(x) any(grepl(patex, x)))
+     obj[isMatch & notMatch,]
+ }
>
> db
   ind test1 test2 test3
1 ind1   1.0    56   1.1
2 ind2   2.0    27  28.0
3 ind3   1.3    58   9.0
4 ind4   3.0     2   1.2
> f.match(db, terms_include, terms_exclude)
   ind test1 test2 test3
2 ind2     2    27    28
>

On Mon, Dec 5, 2011 at 6:32 AM, natalie.vanzuydam <[hidden email]> wrote:

> Hi R users,
>
> I really need help with subsetting  data frames:
>
> I have a large database of medical records and I want to be able to match
> patterns from a list of search terms .
>
> I've used this simplified data frame in a previous example:
>
>
> db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
> 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
> 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
> "data.frame", row.names = c(NA,
> -4L))
>
> terms_include <- c("1","2","3")
> terms_exclude <- c("1.1","1.2","1.3")
>
>
> So in this example I want to include all the terms from terms include as
> long as they don't occur with terms exclude in the same row of the data
> frame.
>
> Previously I was given this function which works very well if you want to
> match exactly:
>
>
> f <- function(x)  !any(x %in% terms_exclude) && any(x %in% terms_include)
> db[apply(db[, -1], 1, f), ]
>
>   ind test1 test2 test3
> 2 ind2     2    27  28.0
> 4 ind4     3     2   1.2
>
>
> I would like to know if there is a way to write a similar function that
> looks for matches that start with the query string:  as in
> grepl("^pattern",x)
>
> I started writing a function but am not sure how to get it to return the
> dataframe or matrix:
>
>
> for (i in 1:length(terms_include)){
> db_new <- apply(db,2, grepl,pattern=i)
> }
>
> Applying this function gives me:
>
> db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L,
> 4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3"
> )))
>
> So the above is searching the pattern anywhere in the dataframe instead of
> just at the beginning of the string.
>
> How would I incorporate look for terms to include but don't return the row
> of the data frame if it also includes one of the terms to exclude while
> using partial matching?
>
> I hope that this makes sense.
>
> Many thanks,
> Natalie
>
> -----
> Natalie Van Zuydam
>
> PhD Student
> University of Dundee
> [hidden email]
> --
> View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-tp4160127p4160127.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.