Pattern match

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Pattern match

Neeti
Hi ALL,

I have very simple question regarding pattern matching. Could anyone tell me how to I can use R to retrieve string pattern from text file.  for example my file contain following information

SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+
H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+

and I want to extract “SpeciesScientific = (?)” information from this file. Problem is in 3rd line where SpeciesScientific word is divided with +.  

Could anyone help me please?
Thank you
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

djmuseR
Hi:

This is a bit of a roundabout approach; I'm sure that folks with regex
expertise will trump this in a heartbeat. I modified the last piece of
the string a bit to accommodate the approach below. Depending on where
the strings have line breaks, you may have some odd '\n' characters
inserted.

# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"

# Step 2: Split input lines by the ';' delimiter and then use lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))

# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)

# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
      Name                 Value
 [1,] "SpeciesCommon"      "Human"
 [2,] "SpeciesScientific"  "Homo sapiens"
 [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
 [4,] "BondInvolved"       "C-H"
 [5,] "EzCatDBID"          "S00343"
 [6,] "BondFormed"         "O-H,O-H"
 [7,] "Bond"               "255B"
 [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
 [9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon"      "Bacteria"
[12,] "Reactive"           "Ce+"

# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
                         Value
2                 Homo sapiens
10 Achromobacter\ncycloclastes


One possible 'advantage' of this approach is that if you have a number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.

Dennis



On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:

> Hi ALL,
>
> I have very simple question regarding pattern matching. Could anyone tell me
> how to I can use R to retrieve string pattern from text file.  for example
> my file contain following information
>
> SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+
> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
> eciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>
> and I want to extract “SpeciesScientific = (?)” information from this file.
> Problem is in 3rd line where SpeciesScientific word is divided with +.
>
> Could anyone help me please?
> Thank you
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

Neeti
Thank you Dennis,

yes the problem is the input file. i have .rdf file and the format is in
same way i have posted earlier. if i open that file in notepad++ the lines
are divided or broken  with CR+LF character. so any suggestion to retrieve
SpeciesScientific information without changing the input file?

Thank you

On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <[hidden email]> wrote:

> Hi:
>
> This is a bit of a roundabout approach; I'm sure that folks with regex
> expertise will trump this in a heartbeat. I modified the last piece of
> the string a bit to accommodate the approach below. Depending on where
> the strings have line breaks, you may have some odd '\n' characters
> inserted.
>
> # Step 1: read the input as a single character string
> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>
> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>
> # Step 2: Split input lines by the ';' delimiter and then use lapply()
> to split variable names from values.
> # This results in a nested list for ulist2.
> ulist <- strsplit(u, ';')
> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>
> # Step 3: Break out the results into a matrix whose first column is
> the variable name
> # and whose second column is the value (with parens included)
> # This avoids dealing with nested lists
> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>
> # Step 4: Strip off the parens
> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
> colnames(w) <- c('Name', 'Value')
> w
>      Name                 Value
>  [1,] "SpeciesCommon"      "Human"
>  [2,] "SpeciesScientific"  "Homo sapiens"
>  [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>  [4,] "BondInvolved"       "C-H"
>  [5,] "EzCatDBID"          "S00343"
>  [6,] "BondFormed"         "O-H,O-H"
>  [7,] "Bond"               "255B"
>  [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>  [9,] "CatalyticSwissProt" "P25006"
> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
> [11,] "SpeciesCommon"      "Bacteria"
> [12,] "Reactive"           "Ce+"
>
> # Step 5: Subset out the values of the SpeciesScientific variables
> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>                         Value
> 2                 Homo sapiens
> 10 Achromobacter\ncycloclastes
>
>
> One possible 'advantage' of this approach is that if you have a number
> of string records of this type, you can create nested lists for each
> string and then manipulate the lists to get what you need. Hopefully
> you can use some of these ideas for other purposes as well.
>
> Dennis
>
>
>
> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:
> > Hi ALL,
> >
> > I have very simple question regarding pattern matching. Could anyone tell
> me
> > how to I can use R to retrieve string pattern from text file.  for
> example
> > my file contain following information
> >
> > SpeciesCommon=(Human);SpeciesScientific=(Homo
> > sapiens);ReactiveCentres=(N,C,C,C,+
> >
> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
> >
> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
> > eciesScientific=(Achromobacter
> > cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
> >
> > and I want to extract “SpeciesScientific = (?)” information from this
> file.
> > Problem is in 3rd line where SpeciesScientific word is divided with +.
> >
> > Could anyone help me please?
> > Thank you
> >
> >
> > --
> > View this message in context:
> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

David Winsemius

On Apr 21, 2011, at 5:27 AM, neetika nath wrote:

> Thank you Dennis,
>
> yes the problem is the input file. i have .rdf file and the format  
> is in
> same way i have posted earlier. if i open that file in notepad++ the  
> lines
> are divided or broken  with CR+LF character. so any suggestion to  
> retrieve
> SpeciesScientific information without changing the input file?

You might consider attaching the original file named with an extension  
of `.txt`, since your verbal description does not match your included  
example. What I see after the various servers have passed this around  
and inserted line-ends is the string `SpeciesScientific` in the first  
line, rather than in the third.

--
David

--

>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <[hidden email]>  
> wrote:
>
>> Hi:
>>
>> This is a bit of a roundabout approach; I'm sure that folks with  
>> regex
>> expertise will trump this in a heartbeat. I modified the last piece  
>> of
>> the string a bit to accommodate the approach below. Depending on  
>> where
>> the strings have line breaks, you may have some odd '\n' characters
>> inserted.
>>
>> # Step 1: read the input as a single character string
>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>
>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-
>> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>
>> # Step 2: Split input lines by the ';' delimiter and then use  
>> lapply()
>> to split variable names from values.
>> # This results in a nested list for ulist2.
>> ulist <- strsplit(u, ';')
>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>
>> # Step 3: Break out the results into a matrix whose first column is
>> the variable name
>> # and whose second column is the value (with parens included)
>> # This avoids dealing with nested lists
>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>
>> # Step 4: Strip off the parens
>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>> colnames(w) <- c('Name', 'Value')
>> w
>>     Name                 Value
>> [1,] "SpeciesCommon"      "Human"
>> [2,] "SpeciesScientific"  "Homo sapiens"
>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>> [4,] "BondInvolved"       "C-H"
>> [5,] "EzCatDBID"          "S00343"
>> [6,] "BondFormed"         "O-H,O-H"
>> [7,] "Bond"               "255B"
>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>> [9,] "CatalyticSwissProt" "P25006"
>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>> [11,] "SpeciesCommon"      "Bacteria"
>> [12,] "Reactive"           "Ce+"
>>
>> # Step 5: Subset out the values of the SpeciesScientific variables
>> subset(as.data.frame(w), Name == 'SpeciesScientific', select =  
>> 'Value')
>>                        Value
>> 2                 Homo sapiens
>> 10 Achromobacter\ncycloclastes
>>
>>
>> One possible 'advantage' of this approach is that if you have a  
>> number
>> of string records of this type, you can create nested lists for each
>> string and then manipulate the lists to get what you need. Hopefully
>> you can use some of these ideas for other purposes as well.
>>
>> Dennis
>>
>>
>>
>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:
>>> Hi ALL,
>>>
>>> I have very simple question regarding pattern matching. Could  
>>> anyone tell
>> me
>>> how to I can use R to retrieve string pattern from text file.  for
>> example
>>> my file contain following information
>>>
>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>
>> H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);Sp+
>>> eciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>
>>> and I want to extract “SpeciesScientific = (?)” information from  
>>> this
>> file.
>>> Problem is in 3rd line where SpeciesScientific word is divided  
>>> with +.
>>>
>>> Could anyone help me please?
>>> Thank you
>>>
>>>
>>> --
>>> View this message in context:
>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

Neeti
Thank you for your message. please see attach file for the template/test
dataset of my file.


On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <[hidden email]>wrote:

>
> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>
>  Thank you Dennis,
>>
>> yes the problem is the input file. i have .rdf file and the format is in
>> same way i have posted earlier. if i open that file in notepad++ the lines
>> are divided or broken  with CR+LF character. so any suggestion to retrieve
>> SpeciesScientific information without changing the input file?
>>
>
> You might consider attaching the original file named with an extension of
> `.txt`, since your verbal description does not match your included example.
> What I see after the various servers have passed this around and inserted
> line-ends is the string `SpeciesScientific` in the first line, rather than
> in the third.
>
> --
> David
>
> --
>
>>
>> Thank you
>>
>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <[hidden email]> wrote:
>>
>>  Hi:
>>>
>>> This is a bit of a roundabout approach; I'm sure that folks with regex
>>> expertise will trump this in a heartbeat. I modified the last piece of
>>> the string a bit to accommodate the approach below. Depending on where
>>> the strings have line breaks, you may have some odd '\n' characters
>>> inserted.
>>>
>>> # Step 1: read the input as a single character string
>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>
>>>
>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>>
>>> # Step 2: Split input lines by the ';' delimiter and then use lapply()
>>> to split variable names from values.
>>> # This results in a nested list for ulist2.
>>> ulist <- strsplit(u, ';')
>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>>
>>> # Step 3: Break out the results into a matrix whose first column is
>>> the variable name
>>> # and whose second column is the value (with parens included)
>>> # This avoids dealing with nested lists
>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>>
>>> # Step 4: Strip off the parens
>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>>> colnames(w) <- c('Name', 'Value')
>>> w
>>>    Name                 Value
>>> [1,] "SpeciesCommon"      "Human"
>>> [2,] "SpeciesScientific"  "Homo sapiens"
>>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>>> [4,] "BondInvolved"       "C-H"
>>> [5,] "EzCatDBID"          "S00343"
>>> [6,] "BondFormed"         "O-H,O-H"
>>> [7,] "Bond"               "255B"
>>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>>> [9,] "CatalyticSwissProt" "P25006"
>>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>>> [11,] "SpeciesCommon"      "Bacteria"
>>> [12,] "Reactive"           "Ce+"
>>>
>>> # Step 5: Subset out the values of the SpeciesScientific variables
>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>>>                       Value
>>> 2                 Homo sapiens
>>> 10 Achromobacter\ncycloclastes
>>>
>>>
>>> One possible 'advantage' of this approach is that if you have a number
>>> of string records of this type, you can create nested lists for each
>>> string and then manipulate the lists to get what you need. Hopefully
>>> you can use some of these ideas for other purposes as well.
>>>
>>> Dennis
>>>
>>>
>>>
>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:
>>>
>>>> Hi ALL,
>>>>
>>>> I have very simple question regarding pattern matching. Could anyone
>>>> tell
>>>>
>>> me
>>>
>>>> how to I can use R to retrieve string pattern from text file.  for
>>>>
>>> example
>>>
>>>> my file contain following information
>>>>
>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>>
>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>>>>
>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
>>>
>>>> eciesScientific=(Achromobacter
>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>>
>>>> and I want to extract “SpeciesScientific = (?)” information from this
>>>>
>>> file.
>>>
>>>> Problem is in 3rd line where SpeciesScientific word is divided with +.
>>>>
>>>> Could anyone help me please?
>>>> Thank you
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>>
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>>
>>> http://www.R-project.org/posting-guide.html
>>>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

temp_test.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

David Winsemius

On Apr 22, 2011, at 6:42 AM, neetika nath wrote:

>
> Thank you for your message. please see attach file for the template/
> test dataset of my file.
>
>
> On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <[hidden email]
> > wrote:
>
> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>
> Thank you Dennis,
>
> yes the problem is the input file. i have .rdf file and the format  
> is in
> same way i have posted earlier. if i open that file in notepad++ the  
> lines
> are divided or broken  with CR+LF character. so any suggestion to  
> retrieve
> SpeciesScientific information without changing the input file?
>
> You might consider attaching the original file named with an  
> extension of `.txt`, since your verbal description does not match  
> your included example. What I see after the various servers have  
> passed this around and inserted line-ends is the string  
> `SpeciesScientific` in the first line, rather than in the third.
>
  lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt")
  lines <- readLines(lcon)
  lines
#-----don't paste---
  [1] "--"
  [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
  [3] "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,
1,C-C,1,C=C,2,C-+"
  [4] "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,
1);CatalyticResidues=(Gl+"
  [5] "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,
601,none);CatalyticSwissProt=(P15559);+"
  [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo  
sapiens);ReactiveCentres=(N,C,C,C,+"
  [7] "H,O,C,C,C,C,O,H);BondInvolved=(C-
H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+"
  [8] ""
  [9] "--"
[10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
[11] "$DATUM  
CatalyticCATH
=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+"
# end don't paste-------------


# So the first goal is to collapse the broken lines but only within  
boundaries of "--"
# Find the line numbers with "--"
  startidx <- grep("\\-\\-", lines)
  startidx
#[1]  1  9 17
  endidx <- c(startidx[-1]-1, length(lines))
  endidx
#[1]  8 16 25
# Now collapse within those ranges
  unplus <- sapply(1:length(startidx), function(x){
                     gsub("\\+", "",  
paste(lines[startidx[x]:endidx[x]], collapse="") )
                                                  } )
# break on what appears to be the correct delimiter, ";"
  lapply(unplus, function(longline)
                 grep("SpeciesScientific=\\(", strsplit(longline, ";")
[[1]] ) )
#[[1]]
#[1] 7

#[[2]]
#[1] 5

#[[3]]
#[1] 6
#Seems to succeed (admittedly after some errors that were elided. So  
save it

  lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\
(", strsplit(longline, ";")[[1]] ) )
#Create a properly split list to work with
  breaklist <- strsplit(unplus, ";")
# And extract the desired elements
  sapply(1:length(startidx), function(idx) breaklist[[idx]]
[ lidx[[idx]] ] )
#[1] "SpeciesScientific=(Homo sapiens)"                
"SpeciesScientific=(Achromobacter cycloclastes)"
#[3] "SpeciesScientific=(Triticum aestivum)"
# Pulling the species from this simple list is left as a reader's  
exercise

--
David

>
> --
>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <[hidden email]>  
> wrote:
>
> Hi:
>
> This is a bit of a roundabout approach; I'm sure that folks with regex
> expertise will trump this in a heartbeat. I modified the last piece of
> the string a bit to accommodate the approach below. Depending on where
> the strings have line breaks, you may have some odd '\n' characters
> inserted.
>
> # Step 1: read the input as a single character string
> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>
> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-
> H);EzCatDBID=(S00343);BondFormed=(O-H,O-
> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>
> # Step 2: Split input lines by the ';' delimiter and then use lapply()
> to split variable names from values.
> # This results in a nested list for ulist2.
> ulist <- strsplit(u, ';')
> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>
> # Step 3: Break out the results into a matrix whose first column is
> the variable name
> # and whose second column is the value (with parens included)
> # This avoids dealing with nested lists
> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>
> # Step 4: Strip off the parens
> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
> colnames(w) <- c('Name', 'Value')
> w
>    Name                 Value
> [1,] "SpeciesCommon"      "Human"
> [2,] "SpeciesScientific"  "Homo sapiens"
> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
> [4,] "BondInvolved"       "C-H"
> [5,] "EzCatDBID"          "S00343"
> [6,] "BondFormed"         "O-H,O-H"
> [7,] "Bond"               "255B"
> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
> [9,] "CatalyticSwissProt" "P25006"
> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
> [11,] "SpeciesCommon"      "Bacteria"
> [12,] "Reactive"           "Ce+"
>
> # Step 5: Subset out the values of the SpeciesScientific variables
> subset(as.data.frame(w), Name == 'SpeciesScientific', select =  
> 'Value')
>                       Value
> 2                 Homo sapiens
> 10 Achromobacter\ncycloclastes
>
>
> One possible 'advantage' of this approach is that if you have a number
> of string records of this type, you can create nested lists for each
> string and then manipulate the lists to get what you need. Hopefully
> you can use some of these ideas for other purposes as well.
>
> Dennis
>
>
>
> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:
> Hi ALL,
>
> I have very simple question regarding pattern matching. Could anyone  
> tell
> me
> how to I can use R to retrieve string pattern from text file.  for
> example
> my file contain following information
>
> SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+
>
> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-
> H,O-H);Bond+
>
> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
> 502,A);CatalyticSwissProt=(P25006);Sp+
> eciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>
> and I want to extract “SpeciesScientific = (?)” information from this
> file.
> Problem is in 3rd line where SpeciesScientific word is divided with +.
>
> Could anyone help me please?
> Thank you
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
>
> <temp_test.txt>
David Winsemius, MD
West Hartford, CT


        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Pattern match

Neeti
Thank you so much.

On Fri, Apr 22, 2011 at 1:29 PM, David Winsemius <[hidden email]>wrote:

>
> On Apr 22, 2011, at 6:42 AM, neetika nath wrote:
>
>
> Thank you for your message. please see attach file for the template/test
> dataset of my file.
>
>
> On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <[hidden email]>wrote:
>
>>
>> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>>
>>  Thank you Dennis,
>>>
>>> yes the problem is the input file. i have .rdf file and the format is in
>>> same way i have posted earlier. if i open that file in notepad++ the
>>> lines
>>> are divided or broken  with CR+LF character. so any suggestion to
>>> retrieve
>>> SpeciesScientific information without changing the input file?
>>>
>>
>> You might consider attaching the original file named with an extension of
>> `.txt`, since your verbal description does not match your included example.
>> What I see after the various servers have passed this around and inserted
>> line-ends is the string `SpeciesScientific` in the first line, rather than
>> in the third.
>>
>>  lcon <- file("/Users/davidwinsemius/Downloads/temp_test.txt")
>  lines <- readLines(lcon)
>  lines
> #-----don't paste---
>  [1] "--"
>
>  [2] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
>  [3]
> "lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+"
>  [4]
> "C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+"
>  [5]
> "y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+"
>  [6] "SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+"
>  [7]
> "H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+"
>  [8] ""
>
>  [9] "--"
>
> [10] "$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION"
>
> [11] "$DATUM
> CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+"
> # end don't paste-------------
>
>
> # So the first goal is to collapse the broken lines but only within
> boundaries of "--"
> # Find the line numbers with "--"
>  startidx <- grep("\\-\\-", lines)
>  startidx
> #[1]  1  9 17
>  endidx <- c(startidx[-1]-1, length(lines))
>  endidx
> #[1]  8 16 25
> # Now collapse within those ranges
>  unplus <- sapply(1:length(startidx), function(x){
>                     gsub("\\+", "", paste(lines[startidx[x]:endidx[x]],
> collapse="") )
>                                                  } )
> # break on what appears to be the correct delimiter, ";"
>  lapply(unplus, function(longline)
>                 grep("SpeciesScientific=\\(", strsplit(longline, ";")[[1]]
> ) )
> #[[1]]
> #[1] 7
>
> #[[2]]
> #[1] 5
>
> #[[3]]
> #[1] 6
> #Seems to succeed (admittedly after some errors that were elided. So save
> it
>
>  lidx <- lapply(unplus, function(longline) grep("SpeciesScientific=\\(",
> strsplit(longline, ";")[[1]] ) )
> #Create a properly split list to work with
>  breaklist <- strsplit(unplus, ";")
> # And extract the desired elements
>  sapply(1:length(startidx), function(idx) breaklist[[idx]][ lidx[[idx]] ] )
> #[1] "SpeciesScientific=(Homo sapiens)"
> "SpeciesScientific=(Achromobacter cycloclastes)"
> #[3] "SpeciesScientific=(Triticum aestivum)"
> # Pulling the species from this simple list is left as a reader's exercise
>
> --
>> David
>>
>
>>
>> --
>>
>>>
>>> Thank you
>>>
>>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <[hidden email]>
>>> wrote:
>>>
>>>  Hi:
>>>>
>>>> This is a bit of a roundabout approach; I'm sure that folks with regex
>>>> expertise will trump this in a heartbeat. I modified the last piece of
>>>> the string a bit to accommodate the approach below. Depending on where
>>>> the strings have line breaks, you may have some odd '\n' characters
>>>> inserted.
>>>>
>>>> # Step 1: read the input as a single character string
>>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>
>>>>
>>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>>>
>>>> # Step 2: Split input lines by the ';' delimiter and then use lapply()
>>>> to split variable names from values.
>>>> # This results in a nested list for ulist2.
>>>> ulist <- strsplit(u, ';')
>>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>>>
>>>> # Step 3: Break out the results into a matrix whose first column is
>>>> the variable name
>>>> # and whose second column is the value (with parens included)
>>>> # This avoids dealing with nested lists
>>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>>>
>>>> # Step 4: Strip off the parens
>>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>>>> colnames(w) <- c('Name', 'Value')
>>>> w
>>>>    Name                 Value
>>>> [1,] "SpeciesCommon"      "Human"
>>>> [2,] "SpeciesScientific"  "Homo sapiens"
>>>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>>>> [4,] "BondInvolved"       "C-H"
>>>> [5,] "EzCatDBID"          "S00343"
>>>> [6,] "BondFormed"         "O-H,O-H"
>>>> [7,] "Bond"               "255B"
>>>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>>>> [9,] "CatalyticSwissProt" "P25006"
>>>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>>>> [11,] "SpeciesCommon"      "Bacteria"
>>>> [12,] "Reactive"           "Ce+"
>>>>
>>>> # Step 5: Subset out the values of the SpeciesScientific variables
>>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>>>>                       Value
>>>> 2                 Homo sapiens
>>>> 10 Achromobacter\ncycloclastes
>>>>
>>>>
>>>> One possible 'advantage' of this approach is that if you have a number
>>>> of string records of this type, you can create nested lists for each
>>>> string and then manipulate the lists to get what you need. Hopefully
>>>> you can use some of these ideas for other purposes as well.
>>>>
>>>> Dennis
>>>>
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <[hidden email]> wrote:
>>>>
>>>>> Hi ALL,
>>>>>
>>>>> I have very simple question regarding pattern matching. Could anyone
>>>>> tell
>>>>>
>>>> me
>>>>
>>>>> how to I can use R to retrieve string pattern from text file.  for
>>>>>
>>>> example
>>>>
>>>>> my file contain following information
>>>>>
>>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>>>
>>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>>
>>>>>
>>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
>>>>
>>>>> eciesScientific=(Achromobacter
>>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>>>
>>>>> and I want to extract “SpeciesScientific = (?)” information from this
>>>>>
>>>> file.
>>>>
>>>>> Problem is in 3rd line where SpeciesScientific word is divided with +.
>>>>>
>>>>> Could anyone help me please?
>>>>> Thank you
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>>>
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>>
>>>> http://www.R-project.org/posting-guide.html
>>>>
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
> <temp_test.txt>
>
>
>  David Winsemius, MD
> West Hartford, CT
>
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.