strapply and characters adjacent to the matched pattern

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

strapply and characters adjacent to the matched pattern

mdvaan
Hi,

In the example below, one of the searched patterns "SE" is matched in the word "second". I would like to ignore all matches in which the character following the match is one of [:alpha:]. How do I do this without removing the "ignore.case = T" argument of the strapply function? Thank you very much!

# load library
require(gsubfn)
# read in data
data <- c("Santa Fe Gold Corp|Starpharma Holdings|SE")
# define the object to be searched
text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma Holdings")
# match
strapply(text, data, ignore.case = T)

The preferred outcome would be:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

instead of:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "se"                  "Starpharma Holdings"


Reply | Threaded
Open this post in threaded view
|

Re: strapply and characters adjacent to the matched pattern

arun kirshna
HI,

Tried matching with data and text using strapply-Unsuccessful.  But, you can get the result from the data alone if that helps you. 

dat2<-strapply(data,"[^\\|]",c)
 list1<-list(paste(dat2[[1]][1:18],collapse=""),paste(dat2[[1]][19:37],collapse=""))
 list1
[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

A.K.



----- Original Message -----
From: mdvaan <[hidden email]>
To: [hidden email]
Cc:
Sent: Tuesday, July 24, 2012 5:06 PM
Subject: [R] strapply and characters adjacent to the matched pattern

Hi,

In the example below, one of the searched patterns "SE" is matched in the
word "second". I would like to ignore all matches in which the character
following the match is one of [:alpha:]. How do I do this without removing
the "ignore.case = T" argument of the strapply function? Thank you very
much!

# load library
require(gsubfn)
# read in data
data <- c("Santa Fe Gold Corp|Starpharma Holdings|SE")
# define the object to be searched
text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
Holdings")
# match
strapply(text, data, ignore.case = T)

The preferred outcome would be:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

instead of:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "se"                  "Starpharma Holdings"






--
View this message in context: http://r.789695.n4.nabble.com/strapply-and-characters-adjacent-to-the-matched-pattern-tp4637673.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: strapply and characters adjacent to the matched pattern

Gabor Grothendieck
In reply to this post by mdvaan
On Tue, Jul 24, 2012 at 5:06 PM, mdvaan <[hidden email]> wrote:

> Hi,
>
> In the example below, one of the searched patterns "SE" is matched in the
> word "second". I would like to ignore all matches in which the character
> following the match is one of [:alpha:]. How do I do this without removing
> the "ignore.case = T" argument of the strapply function? Thank you very
> much!
>
> # load library
> require(gsubfn)
> # read in data
> data <- c("Santa Fe Gold Corp|Starpharma Holdings|SE")
> # define the object to be searched
> text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
> # match
> strapply(text, data, ignore.case = T)
>
> The preferred outcome would be:
>
> [[1]]
> [1] "Santa Fe Gold Corp"
>
> [[2]]
> [1] "Starpharma Holdings"
>
> instead of:
>
> [[1]]
> [1] "Santa Fe Gold Corp"
>
> [[2]]
> [1] "se"                  "Starpharma Holdings"
>
>

Try this:

> strapply(c("abc", "ab", "ab def"), "(ab|d)($|[^[[:alpha:]])")
[[1]]
NULL

[[2]]
[1] "ab"

[[3]]
[1] "ab"


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: strapply and characters adjacent to the matched pattern

mdvaan
Thanks Gabor. That worked really well. I have been reading about the use of POSIX and regular expressions and I tried to use your example to see if I could  ignore all matches in which the character preceding (rather than following) the match is one of [:alpha:]? So far, I have been unsuccessful. Could anyone help me out here or direct me to a source that explains the combined use of POSIX and regular expressions? Thanks!

require(gsubfn)
# this only checks for the characters following the match and therefore matches also matches the third element
# however I want it to match only the 2nd, 5th and 6th elements
strapply(c("abc", "ab", "abdef", "defc", "def", " def "), "(def|ab)($|[^[[:alpha:]])")

The outcome should look like this:
[[1]]
NULL

[[2]]
[1] "ab"

[[3]]
NULL

[[4]]
NULL

[[5]]
[1] "def"

[[6]]
[1] "def"


Gabor Grothendieck wrote
On Tue, Jul 24, 2012 at 5:06 PM, mdvaan <[hidden email]> wrote:
> Hi,
>
> In the example below, one of the searched patterns "SE" is matched in the
> word "second". I would like to ignore all matches in which the character
> following the match is one of [:alpha:]. How do I do this without removing
> the "ignore.case = T" argument of the strapply function? Thank you very
> much!
>
> # load library
> require(gsubfn)
> # read in data
> data <- c("Santa Fe Gold Corp|Starpharma Holdings|SE")
> # define the object to be searched
> text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
> # match
> strapply(text, data, ignore.case = T)
>
> The preferred outcome would be:
>
> [[1]]
> [1] "Santa Fe Gold Corp"
>
> [[2]]
> [1] "Starpharma Holdings"
>
> instead of:
>
> [[1]]
> [1] "Santa Fe Gold Corp"
>
> [[2]]
> [1] "se"                  "Starpharma Holdings"
>
>

Try this:

> strapply(c("abc", "ab", "ab def"), "(ab|d)($|[^[[:alpha:]])")
[[1]]
NULL

[[2]]
[1] "ab"

[[3]]
[1] "ab"


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: strapply and characters adjacent to the matched pattern

Gabor Grothendieck
On Wed, Jul 25, 2012 at 4:34 PM, mdvaan <[hidden email]> wrote:
> Thanks Gabor. That worked really well. I have been reading about the use of
> POSIX and regular expressions and I tried to use your example to see if I
> could  ignore all matches in which the character preceding (rather than
> following) the match is one of [:alpha:]? So far, I have been unsuccessful.
> Could anyone help me out here or direct me to a source that explains the
> combined use of POSIX and regular expressions? Thanks!
>

We match the start of the string or a non-alpha followed by the
desired string (here its xx).  Because we want the second back
reference (the default is to return the first back rerference if the
function is omitted) we must specifically tell it that by using a
function which returns its second argument, e.g. function(x, y) y or
function(...) ..2 or using the equivalent formula notation just ~ ..2
:

strapply(c("cxx", "xxc", "xx", " xx"), "(^|[^[:alpha:]])(xx)", ~ ..2)


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: strapply and characters adjacent to the matched pattern

mdvaan
Thanks Gabor for your invaluable help! I learned a lot.