|
Hi all,
My code looks like the following: inname = read.csv("ID_error_checker.csv", as.is=TRUE) outname = read.csv("output.csv", as.is=TRUE) #My algorithm is the following: #for line in inname #if first string up to whitespace in row in inname$name = first string up to whitespace in row + 1 in inname$name #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row below it #copy these two lines to a new file In other words, if the name (up to the first whitespace) in the first row equals the name in the second row (etc for whole file) and the ID in the first row does not equal the ID in the second row, copy both of these rows in full to a new file. Only caveat is that I want a regular expression not to take the full names, but just the first string up to the first whitespace in the inname$name column (ie if row1 has a name of: New York Mets and row2 has a name of New York Yankees, I would want both of these rows to be copied in full since "New" is the same in both...) Here is some example data: ID NAME YEAR SOURCE NOTES 1 New York Mets 1900 ESPN 2 New York Yankees 1920 Cooperstown 3 Boston Redsox 1918 ESPN 4 Washington Nationals 2010 ESPN 5 Detroit Tigers 1990 ESPN The desired output would be: ID NAME YEAR SOURCE 1 New York Mets 1900 ESPN 2 New York Yankees 1920 Cooperstown Thanks so much! [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hi, Try this: dat1<-read.table(text=" ID, NAME, YEAR, SOURCE 1, New York Mets, 1900, ESPN 2, New York Yankees, 1920, Cooperstown 3, Boston Redsox, 1918, ESPN 4, Washington Nationals, 2010, ESPN 5, Detroit Tigers, 1990, ESPN ",sep=",",header=TRUE,stringsAsFactors=FALSE) index<-grep("New York.*",dat1$NAME) dat1[index,] # ID NAME YEAR SOURCE #1 1 New York Mets 1900 ESPN #2 2 New York Yankees 1920 Cooperstown A.K. ----- Original Message ----- From: Fred G <[hidden email]> To: [hidden email] Cc: Sent: Friday, August 10, 2012 1:41 PM Subject: [R] Regular Expressions + Matrices Hi all, My code looks like the following: inname = read.csv("ID_error_checker.csv", as.is=TRUE) outname = read.csv("output.csv", as.is=TRUE) #My algorithm is the following: #for line in inname #if first string up to whitespace in row in inname$name = first string up to whitespace in row + 1 in inname$name #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row below it #copy these two lines to a new file In other words, if the name (up to the first whitespace) in the first row equals the name in the second row (etc for whole file) and the ID in the first row does not equal the ID in the second row, copy both of these rows in full to a new file. Only caveat is that I want a regular expression not to take the full names, but just the first string up to the first whitespace in the inname$name column (ie if row1 has a name of: New York Mets and row2 has a name of New York Yankees, I would want both of these rows to be copied in full since "New" is the same in both...) Here is some example data: ID NAME YEAR SOURCE NOTES 1 New York Mets 1900 ESPN 2 New York Yankees 1920 Cooperstown 3 Boston Redsox 1918 ESPN 4 Washington Nationals 2010 ESPN 5 Detroit Tigers 1990 ESPN The desired output would be: ID NAME YEAR SOURCE 1 New York Mets 1900 ESPN 2 New York Yankees 1920 Cooperstown Thanks so much! [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thanks Arun. The only issue is that I need the code to be very
generalizable, such that the grep() really has to be if the first string up to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit below) is the same as the first string up to the whitespace in the row directly below it, AND the ID's are different, then copy. The actual file has thousands of different IDs and names... On Fri, Aug 10, 2012 at 2:01 PM, arun <[hidden email]> wrote: > > > Hi, > > Try this: > dat1<-read.table(text=" > ID, NAME, YEAR, SOURCE > 1, New York Mets, 1900, ESPN > 2, New York Yankees, 1920, Cooperstown > 3, Boston Redsox, 1918, ESPN > 4, Washington Nationals, 2010, ESPN > 5, Detroit Tigers, 1990, ESPN > ",sep=",",header=TRUE,stringsAsFactors=FALSE) > > index<-grep("New York.*",dat1$NAME) > dat1[index,] > # ID NAME YEAR SOURCE > #1 1 New York Mets 1900 ESPN > #2 2 New York Yankees 1920 Cooperstown > > A.K. > > > > ----- Original Message ----- > From: Fred G <[hidden email]> > To: [hidden email] > Cc: > Sent: Friday, August 10, 2012 1:41 PM > Subject: [R] Regular Expressions + Matrices > > Hi all, > > My code looks like the following: > inname = read.csv("ID_error_checker.csv", as.is=TRUE) > outname = read.csv("output.csv", as.is=TRUE) > > #My algorithm is the following: > #for line in inname > #if first string up to whitespace in row in inname$name = first string up > to whitespace in row + 1 in inname$name > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row > below it > #copy these two lines to a new file > > In other words, if the name (up to the first whitespace) in the first row > equals the name in the second row (etc for whole file) and the ID in the > first row does not equal the ID in the second row, copy both of these rows > in full to a new file. Only caveat is that I want a regular expression not > to take the full names, but just the first string up to the first > whitespace in the inname$name column (ie if row1 has a name of: New York > Mets and row2 has a name of New York Yankees, I would want both of these > rows to be copied in full since "New" is the same in both...) > > Here is some example data: > ID NAME YEAR SOURCE NOTES > 1 New York Mets 1900 ESPN > 2 New York Yankees 1920 Cooperstown > 3 Boston Redsox 1918 ESPN > 4 Washington Nationals 2010 ESPN > 5 Detroit Tigers 1990 ESPN > > The desired output would be: > ID NAME YEAR SOURCE > 1 New York Mets 1900 ESPN > 2 New York Yankees 1920 Cooperstown > > Thanks so much! > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Fred G
Hello,
Try the following. d <- read.table(textConnection(" ID NAME YEAR SOURCE 1 'New York Mets' 1900 ESPN 2 'New York Yankees' 1920 Cooperstown 3 'Boston Redsox' 1918 ESPN 4 'Washington Nationals' 2010 ESPN 5 'Detroit Tigers' 1990 ESPN "), header=TRUE) d$NAME <- as.character(d$NAME) fun <- function(i, x){ if(x[i, "ID"] != x[i + 1, "ID"]){ s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] if(grepl(s, x[i + 1, "NAME"])) return(TRUE) } FALSE } inx <- sapply(seq_len(nrow(d) - 1), fun, d) inx <- c(inx, FALSE) | c(FALSE, inx) d[inx, ] Hope this helps, Rui Barradas Em 10-08-2012 18:41, Fred G escreveu: > Hi all, > > My code looks like the following: > inname = read.csv("ID_error_checker.csv", as.is=TRUE) > outname = read.csv("output.csv", as.is=TRUE) > > #My algorithm is the following: > #for line in inname > #if first string up to whitespace in row in inname$name = first string up > to whitespace in row + 1 in inname$name > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row > below it > #copy these two lines to a new file > > In other words, if the name (up to the first whitespace) in the first row > equals the name in the second row (etc for whole file) and the ID in the > first row does not equal the ID in the second row, copy both of these rows > in full to a new file. Only caveat is that I want a regular expression not > to take the full names, but just the first string up to the first > whitespace in the inname$name column (ie if row1 has a name of: New York > Mets and row2 has a name of New York Yankees, I would want both of these > rows to be copied in full since "New" is the same in both...) > > Here is some example data: > ID NAME YEAR SOURCE NOTES > 1 New York Mets 1900 ESPN > 2 New York Yankees 1920 Cooperstown > 3 Boston Redsox 1918 ESPN > 4 Washington Nationals 2010 ESPN > 5 Detroit Tigers 1990 ESPN > > The desired output would be: > ID NAME YEAR SOURCE > 1 New York Mets 1900 ESPN > 2 New York Yankees 1920 Cooperstown > > Thanks so much! > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Fred G
Hello,
My code doesn't predict a point you've made clear in this post. Inline. Em 10-08-2012 19:05, Fred G escreveu: > Thanks Arun. The only issue is that I need the code to be very > generalizable, such that the grep() really has to be if the first string up > to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit > below) is the same as the first string up to the whitespace in the row > directly below it Does this mean that "New York" ---> "New" in one row shouldn't match "Other New" in the next row because "New" is not the first string up to the whitespace? If this is the case, modify my earlier code to fun <- function(i, x){ if(x[i, "ID"] != x[i + 1, "ID"]){ s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] # keep first string s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1] # keep first string if(grepl(s1, s2)) return(TRUE) } FALSE } If it isn't the case, do nothing. Rui Barradas > , AND the ID's are different, then copy. The actual file > has thousands of different IDs and names... > > On Fri, Aug 10, 2012 at 2:01 PM, arun <[hidden email]> wrote: > >> >> Hi, >> >> Try this: >> dat1<-read.table(text=" >> ID, NAME, YEAR, SOURCE >> 1, New York Mets, 1900, ESPN >> 2, New York Yankees, 1920, Cooperstown >> 3, Boston Redsox, 1918, ESPN >> 4, Washington Nationals, 2010, ESPN >> 5, Detroit Tigers, 1990, ESPN >> ",sep=",",header=TRUE,stringsAsFactors=FALSE) >> >> index<-grep("New York.*",dat1$NAME) >> dat1[index,] >> # ID NAME YEAR SOURCE >> #1 1 New York Mets 1900 ESPN >> #2 2 New York Yankees 1920 Cooperstown >> >> A.K. >> >> >> >> ----- Original Message ----- >> From: Fred G <[hidden email]> >> To: [hidden email] >> Cc: >> Sent: Friday, August 10, 2012 1:41 PM >> Subject: [R] Regular Expressions + Matrices >> >> Hi all, >> >> My code looks like the following: >> inname = read.csv("ID_error_checker.csv", as.is=TRUE) >> outname = read.csv("output.csv", as.is=TRUE) >> >> #My algorithm is the following: >> #for line in inname >> #if first string up to whitespace in row in inname$name = first string up >> to whitespace in row + 1 in inname$name >> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row >> below it >> #copy these two lines to a new file >> >> In other words, if the name (up to the first whitespace) in the first row >> equals the name in the second row (etc for whole file) and the ID in the >> first row does not equal the ID in the second row, copy both of these rows >> in full to a new file. Only caveat is that I want a regular expression not >> to take the full names, but just the first string up to the first >> whitespace in the inname$name column (ie if row1 has a name of: New York >> Mets and row2 has a name of New York Yankees, I would want both of these >> rows to be copied in full since "New" is the same in both...) >> >> Here is some example data: >> ID NAME YEAR SOURCE NOTES >> 1 New York Mets 1900 ESPN >> 2 New York Yankees 1920 Cooperstown >> 3 Boston Redsox 1918 ESPN >> 4 Washington Nationals 2010 ESPN >> 5 Detroit Tigers 1990 ESPN >> >> The desired output would be: >> ID NAME YEAR SOURCE >> 1 New York Mets 1900 ESPN >> 2 New York Yankees 1920 Cooperstown >> >> Thanks so much! >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thanks so much, and thanks for the clarification. "New York" ---> "New"
should not match "Other New" because "New" is not the first. Thanks so much, testing it on my data now. On Fri, Aug 10, 2012 at 2:35 PM, Rui Barradas <[hidden email]> wrote: > Hello, > > My code doesn't predict a point you've made clear in this post. Inline. > Em 10-08-2012 19:05, Fred G escreveu: > > Thanks Arun. The only issue is that I need the code to be very >> generalizable, such that the grep() really has to be if the first string >> up >> to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit >> below) is the same as the first string up to the whitespace in the row >> directly below it >> > > Does this mean that "New York" ---> "New" in one row shouldn't match > "Other New" in the next row because "New" is not the first string up to the > whitespace? If this is the case, modify my earlier code to > > > > fun <- function(i, x){ > if(x[i, "ID"] != x[i + 1, "ID"]){ > s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] # keep > first string > s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1] # keep > first string > if(grepl(s1, s2)) return(TRUE) > } > FALSE > } > > If it isn't the case, do nothing. > > Rui Barradas > > > , AND the ID's are different, then copy. The actual file >> has thousands of different IDs and names... >> >> On Fri, Aug 10, 2012 at 2:01 PM, arun <[hidden email]> wrote: >> >> >>> Hi, >>> >>> Try this: >>> dat1<-read.table(text=" >>> ID, NAME, YEAR, SOURCE >>> 1, New York Mets, 1900, ESPN >>> 2, New York Yankees, 1920, Cooperstown >>> 3, Boston Redsox, 1918, ESPN >>> 4, Washington Nationals, 2010, ESPN >>> 5, Detroit Tigers, 1990, ESPN >>> ",sep=",",header=TRUE,**stringsAsFactors=FALSE) >>> >>> index<-grep("New York.*",dat1$NAME) >>> dat1[index,] >>> # ID NAME YEAR SOURCE >>> #1 1 New York Mets 1900 ESPN >>> #2 2 New York Yankees 1920 Cooperstown >>> >>> A.K. >>> >>> >>> >>> ----- Original Message ----- >>> From: Fred G <[hidden email]> >>> To: [hidden email] >>> Cc: >>> Sent: Friday, August 10, 2012 1:41 PM >>> Subject: [R] Regular Expressions + Matrices >>> >>> Hi all, >>> >>> My code looks like the following: >>> inname = read.csv("ID_error_checker.**csv", as.is=TRUE) >>> outname = read.csv("output.csv", as.is=TRUE) >>> >>> #My algorithm is the following: >>> #for line in inname >>> #if first string up to whitespace in row in inname$name = first string up >>> to whitespace in row + 1 in inname$name >>> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the >>> row >>> below it >>> #copy these two lines to a new file >>> >>> In other words, if the name (up to the first whitespace) in the first row >>> equals the name in the second row (etc for whole file) and the ID in the >>> first row does not equal the ID in the second row, copy both of these >>> rows >>> in full to a new file. Only caveat is that I want a regular expression >>> not >>> to take the full names, but just the first string up to the first >>> whitespace in the inname$name column (ie if row1 has a name of: New York >>> Mets and row2 has a name of New York Yankees, I would want both of these >>> rows to be copied in full since "New" is the same in both...) >>> >>> Here is some example data: >>> ID NAME YEAR SOURCE NOTES >>> 1 New York Mets 1900 ESPN >>> 2 New York Yankees 1920 Cooperstown >>> 3 Boston Redsox 1918 ESPN >>> 4 Washington Nationals 2010 ESPN >>> 5 Detroit Tigers 1990 ESPN >>> >>> The desired output would be: >>> ID NAME YEAR SOURCE >>> 1 New York Mets 1900 ESPN >>> 2 New York Yankees 1920 Cooperstown >>> >>> Thanks so much! >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**________________ >>> [hidden email] mailing list >>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >>> PLEASE do read the posting guide >>> http://www.R-project.org/**posting-guide.html<http://www.R-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> [[alternative HTML version deleted]] >> >> ______________________________**________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> >> PLEASE do read the posting guide http://www.R-project.org/** >> posting-guide.html <http://www.R-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Rui Barradas
If you think about this as a runs problem you can get a loopless solution
that I think is easier to read (once the requisite functions are defined). First define the function to canonicalize the name nickname <- function(x) sub(" .*", "", x) then define some handy runs functions isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)]) isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical then use those functions on your dataset > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR) > d[ nearDup | isJustBefore(nearDup), ] ID NAME YEAR SOURCE 1 1 New York Mets 1900 ESPN 2 2 New York Yankees 1920 Cooperstown See how it works with triplicates as well > dd <- rbind(d, data.frame(ID=6:8, NAME=c("Chicago Blacksox", "Chicago Cubs", "Chicago Whitesox"), YEAR=1701:1703, SOURCE=rep("made up", 3))) > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR) > dd[ nearDup | isJustBefore(nearDup), ] ID NAME YEAR SOURCE 1 1 New York Mets 1900 ESPN 2 2 New York Yankees 1920 Cooperstown 6 6 Chicago Blacksox 1701 made up 7 7 Chicago Cubs 1702 made up 8 8 Chicago Whitesox 1703 made up Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On Behalf > Of Rui Barradas > Sent: Friday, August 10, 2012 11:18 AM > To: Fred G > Cc: r-help > Subject: Re: [R] Regular Expressions + Matrices > > Hello, > > Try the following. > > > d <- read.table(textConnection(" > ID NAME YEAR SOURCE > 1 'New York Mets' 1900 ESPN > 2 'New York Yankees' 1920 Cooperstown > 3 'Boston Redsox' 1918 ESPN > 4 'Washington Nationals' 2010 ESPN > 5 'Detroit Tigers' 1990 ESPN > "), header=TRUE) > > d$NAME <- as.character(d$NAME) > > fun <- function(i, x){ > if(x[i, "ID"] != x[i + 1, "ID"]){ > s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] > if(grepl(s, x[i + 1, "NAME"])) return(TRUE) > } > FALSE > } > > inx <- sapply(seq_len(nrow(d) - 1), fun, d) > inx <- c(inx, FALSE) | c(FALSE, inx) > d[inx, ] > > Hope this helps, > > Rui Barradas > Em 10-08-2012 18:41, Fred G escreveu: > > Hi all, > > > > My code looks like the following: > > inname = read.csv("ID_error_checker.csv", as.is=TRUE) > > outname = read.csv("output.csv", as.is=TRUE) > > > > #My algorithm is the following: > > #for line in inname > > #if first string up to whitespace in row in inname$name = first string up > > to whitespace in row + 1 in inname$name > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row > > below it > > #copy these two lines to a new file > > > > In other words, if the name (up to the first whitespace) in the first row > > equals the name in the second row (etc for whole file) and the ID in the > > first row does not equal the ID in the second row, copy both of these rows > > in full to a new file. Only caveat is that I want a regular expression not > > to take the full names, but just the first string up to the first > > whitespace in the inname$name column (ie if row1 has a name of: New York > > Mets and row2 has a name of New York Yankees, I would want both of these > > rows to be copied in full since "New" is the same in both...) > > > > Here is some example data: > > ID NAME YEAR SOURCE NOTES > > 1 New York Mets 1900 ESPN > > 2 New York Yankees 1920 Cooperstown > > 3 Boston Redsox 1918 ESPN > > 4 Washington Nationals 2010 ESPN > > 5 Detroit Tigers 1990 ESPN > > > > The desired output would be: > > ID NAME YEAR SOURCE > > 1 New York Mets 1900 ESPN > > 2 New York Yankees 1920 Cooperstown > > > > Thanks so much! > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thanks Bill! Works great! Thanks again guys!
On Fri, Aug 10, 2012 at 2:43 PM, William Dunlap <[hidden email]> wrote: > If you think about this as a runs problem you can get a loopless solution > that I think is easier to read (once the requisite functions are defined). > > First define the function to canonicalize the name > nickname <- function(x) sub(" .*", "", x) > then define some handy runs functions > isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)]) > isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical > then use those functions on your dataset > > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR) > > d[ nearDup | isJustBefore(nearDup), ] > ID NAME YEAR SOURCE > 1 1 New York Mets 1900 ESPN > 2 2 New York Yankees 1920 Cooperstown > See how it works with triplicates as well > > dd <- rbind(d, data.frame(ID=6:8, > NAME=c("Chicago Blacksox", "Chicago Cubs", > "Chicago Whitesox"), > YEAR=1701:1703, SOURCE=rep("made up", 3))) > > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR) > > dd[ nearDup | isJustBefore(nearDup), ] > ID NAME YEAR SOURCE > 1 1 New York Mets 1900 ESPN > 2 2 New York Yankees 1920 Cooperstown > 6 6 Chicago Blacksox 1701 made up > 7 7 Chicago Cubs 1702 made up > 8 8 Chicago Whitesox 1703 made up > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: [hidden email] [mailto:[hidden email]] > On Behalf > > Of Rui Barradas > > Sent: Friday, August 10, 2012 11:18 AM > > To: Fred G > > Cc: r-help > > Subject: Re: [R] Regular Expressions + Matrices > > > > Hello, > > > > Try the following. > > > > > > d <- read.table(textConnection(" > > ID NAME YEAR SOURCE > > 1 'New York Mets' 1900 ESPN > > 2 'New York Yankees' 1920 Cooperstown > > 3 'Boston Redsox' 1918 ESPN > > 4 'Washington Nationals' 2010 ESPN > > 5 'Detroit Tigers' 1990 ESPN > > "), header=TRUE) > > > > d$NAME <- as.character(d$NAME) > > > > fun <- function(i, x){ > > if(x[i, "ID"] != x[i + 1, "ID"]){ > > s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] > > if(grepl(s, x[i + 1, "NAME"])) return(TRUE) > > } > > FALSE > > } > > > > inx <- sapply(seq_len(nrow(d) - 1), fun, d) > > inx <- c(inx, FALSE) | c(FALSE, inx) > > d[inx, ] > > > > Hope this helps, > > > > Rui Barradas > > Em 10-08-2012 18:41, Fred G escreveu: > > > Hi all, > > > > > > My code looks like the following: > > > inname = read.csv("ID_error_checker.csv", as.is=TRUE) > > > outname = read.csv("output.csv", as.is=TRUE) > > > > > > #My algorithm is the following: > > > #for line in inname > > > #if first string up to whitespace in row in inname$name = first string > up > > > to whitespace in row + 1 in inname$name > > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the > row > > > below it > > > #copy these two lines to a new file > > > > > > In other words, if the name (up to the first whitespace) in the first > row > > > equals the name in the second row (etc for whole file) and the ID in > the > > > first row does not equal the ID in the second row, copy both of these > rows > > > in full to a new file. Only caveat is that I want a regular > expression not > > > to take the full names, but just the first string up to the first > > > whitespace in the inname$name column (ie if row1 has a name of: New > York > > > Mets and row2 has a name of New York Yankees, I would want both of > these > > > rows to be copied in full since "New" is the same in both...) > > > > > > Here is some example data: > > > ID NAME YEAR SOURCE NOTES > > > 1 New York Mets 1900 ESPN > > > 2 New York Yankees 1920 Cooperstown > > > 3 Boston Redsox 1918 ESPN > > > 4 Washington Nationals 2010 ESPN > > > 5 Detroit Tigers 1990 ESPN > > > > > > The desired output would be: > > > ID NAME YEAR SOURCE > > > 1 New York Mets 1900 ESPN > > > 2 New York Yankees 1920 Cooperstown > > > > > > Thanks so much! > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > [hidden email] mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
