Split strings based on multiple patterns

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Split strings based on multiple patterns

jCeradini
Afternoon,

I unfortunately inherited a dataframe with a column that has many fields
smashed together. My goal is to split the strings in the column into
separate columns based on patterns.

Example of what I'm working with:

ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
pH:Unkwn:
Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
substrate: Silt/Mud: Evidence of cattle grazing: none:
Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
amphibians observed")
ugly

Far as I can tell, there is not a single pattern that would work for
splitting this string. Splitting on ":" is close but not quite consistent.
Each of these attributes should be a separate column:

attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity",
"Water color", "Water turbidity", "Manmade", "Permanence", "Max water
depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
Emergent Veg(%)", "Fish present", "Fish species")

So, conceptually, I want to do something like this, where the string is
split for each of the patterns in attributes. However, strsplit only uses
the 1st value of attributes
strsplit(ugly, attributes)

Should I loop through the values of "attributes"?
Is there an argument in strsplit I'm missing that will do what I want?
Different approach altogether?

Thanks! Happy Friday.
Joe

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split strings based on multiple patterns

David Winsemius

> On Oct 14, 2016, at 4:16 PM, Joe Ceradini <[hidden email]> wrote:
>
> Afternoon,
>
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
>
> Example of what I'm working with:
>
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
>
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite consistent.
> Each of these attributes should be a separate column:
>
> attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
>
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
>
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
>
> Thanks! Happy Friday.
> Joe
>
> [[alternative HTML version deleted]]

Need to post in plain text. We cannot see where your "carriage returns" are located in that data. HTML uses some other character(s?) that doesn't get translated by our mailserver.


>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html

Yes, please do read that.

> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split strings based on multiple patterns

Gabor Grothendieck
In reply to this post by jCeradini
Replace newlines and colons with a space since they seem to be junk,
generate a pattern to replace the attributes with a comma and do the
replacement and finally read in what is left into a data frame using
the attributes as column names.

(I have indented each line of code below by 2 spaces so if any line
starts before that then it's been wrapped around by the email and
needs to be adjusted.)

  attributes <-
  c("Water temp", "Waterbody type", "Water pH", "Conductivity",
  "Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth",
  "Primary substrate", "Evidence of cattle grazing", "Shoreline
Emergent Veg(%)",
  "Fish present", "Fish species")

  ugly2 <- gsub("[:\n]", " ", ugly)

  pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|")
  ugly3 <- gsub(pat, ",", ugly2)

  dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE,
col.names = c("", attributes))[-1]


On Fri, Oct 14, 2016 at 7:16 PM, Joe Ceradini <[hidden email]> wrote:

> Afternoon,
>
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
>
> Example of what I'm working with:
>
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
>
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite consistent.
> Each of these attributes should be a separate column:
>
> attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
>
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
>
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
>
> Thanks! Happy Friday.
> Joe
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split strings based on multiple patterns

glsnow
In reply to this post by jCeradini
I would suggest looking at the strapply function in the gsubfn
package.  That gives you more flexibility in specifying what to look
for in the structure of the data, then extract only those pieces that
you want.


On Fri, Oct 14, 2016 at 5:16 PM, Joe Ceradini <[hidden email]> wrote:

> Afternoon,
>
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
>
> Example of what I'm working with:
>
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
>
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite consistent.
> Each of these attributes should be a separate column:
>
> attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
>
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
>
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
>
> Thanks! Happy Friday.
> Joe
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Gregory (Greg) L. Snow Ph.D.
[hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split strings based on multiple patterns

jCeradini
Thanks for the additional approach, Greg. I had success with Gabor's
recommendation but will take a look at gsubfn as well.

Joe

On Wed, Oct 19, 2016 at 10:19 AM, Greg Snow <[hidden email]> wrote:

> I would suggest looking at the strapply function in the gsubfn
> package.  That gives you more flexibility in specifying what to look
> for in the structure of the data, then extract only those pieces that
> you want.
>
>
> On Fri, Oct 14, 2016 at 5:16 PM, Joe Ceradini <[hidden email]>
> wrote:
> > Afternoon,
> >
> > I unfortunately inherited a dataframe with a column that has many fields
> > smashed together. My goal is to split the strings in the column into
> > separate columns based on patterns.
> >
> > Example of what I'm working with:
> >
> > ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> > pH:Unkwn:
> > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> > Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> > substrate: Silt/Mud: Evidence of cattle grazing: none:
> > Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn:
> no
> > amphibians observed")
> > ugly
> >
> > Far as I can tell, there is not a single pattern that would work for
> > splitting this string. Splitting on ":" is close but not quite
> consistent.
> > Each of these attributes should be a separate column:
> >
> > attributes <- c("Water temp", "Waterbody type", "Water pH",
> "Conductivity",
> > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> > depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
> > Emergent Veg(%)", "Fish present", "Fish species")
> >
> > So, conceptually, I want to do something like this, where the string is
> > split for each of the patterns in attributes. However, strsplit only uses
> > the 1st value of attributes
> > strsplit(ugly, attributes)
> >
> > Should I loop through the values of "attributes"?
> > Is there an argument in strsplit I'm missing that will do what I want?
> > Different approach altogether?
> >
> > Thanks! Happy Friday.
> > Joe
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> [hidden email]
>



--
Cooperative Fish and Wildlife Research Unit
Zoology and Physiology Dept.
University of Wyoming
[hidden email] / 914.707.8506
wyocoopunit.org

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.