Matching multiple search criteria (Unlisting a nested dataset, take 2)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Matching multiple search criteria (Unlisting a nested dataset, take 2)

Nathan Parsons
Thanks all for your patience. Here’s a second go that is perhaps more
explicative of what it is I am trying to accomplish (and hopefully in plain
text form)...


I’m using the following packages: tidyverse, purrr, tidytext


I have a number of tweets in the following form:


th <- structure(list(status_id = c("x1047841705729306624",
"x1046966595610927105",

"x1047094786610552832", "x1046988542818308097", "x1046934493553221632",

"x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",

"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",

"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",

"@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
tell you man. You are the fucking messiah",

"@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
being hung over tomorrow vs. not fucking up your life ten years later.",

"I tend to think about my dreams before I sleep.", "@MichaelAvenatti
@SenatorCollins So, if your client was in her 20s, attending parties with
teenagers, doesn't that make her at the least immature as hell, or at the
worst, a pedophile and a person contributing to the delinquency of minors?",

"i wish i could take credit for this"), lat = c(43.6835853, 40.284123,

37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,

-83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426

), county_name = c("Cumberland County", "Delaware County", "San Francisco
County",

"Allegheny County", "Concho County", "Los Angeles County"), fips = c(23005L,

39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",

"Ohio", "California", "Pennsylvania", "Texas", "California"),

state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = c("Medium
Metro",

"Large Fringe Metro", "Large Central Metro", "Large Central Metro",

"NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,

2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,

1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"

), row.names = c(NA, -6L), .internal.selfref = )


I also have a number of search terms in the following form:


st <- structure(list(terms = c("me abused depressed", "me hurt depressed",

"feel hopeless depressed", "feel alone depressed", "i feel helpless",

"i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",

"tbl", "data.frame”))


I am trying to isolate the tweets that contain all of the words in each of
the search terms, i.e “me” “abused” and “depressed” from the first example
search term, but they do not have to be in order or even next to one
another.


I am familiar with the dplyr suite of tools and have been attempting to
generate some sort of ‘filter()’ to do this. I am not very familiar with
purrr, but there may be a solution using the map function? I have also
explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
data in the following way:


> tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt

> head(tt)

status_id created_at lat lng

1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841

county_name fips state_name state_abb urban_level urban_code

1: Cumberland County 23005 Maine ME Medium Metro 3

2: Cumberland County 23005 Maine ME Medium Metro 3

3: Cumberland County 23005 Maine ME Medium Metro 3

4: Cumberland County 23005 Maine ME Medium Metro 3

5: Cumberland County 23005 Maine ME Medium Metro 3

6: Cumberland County 23005 Maine ME Medium Metro 3

population word

1: 277308 technique

2: 277308 is

3: 277308 everything

4: 277308 with

5: 277308 olympic

6: 277308 lifts


but once I have unnested the tokens, I am unable to recombine them back
into tweets.


Ideally the end result would append a new column to the ‘th’ data that
would flag a tweet that contained all of the search words for any of the
search terms; so the work flow would look like

1) look for all search words for one search term in a tweet

2) if all of the search words in the search term are found, create a flag
(mutate(flag = 1) or some such)

3) do this for all of the tweets

4) move on the next search term and repeat


Again, my thanks for your patience.


--


Nate Parsons

Pronouns: He, Him, His

Graduate Teaching Assistant

Department of Sociology

Portland State University

Portland, Oregon


503-725-9025

503-725-3957 FAX

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Nathan Parsons
Argh! Here are those two example datasets as data frames (not tibbles).
Sorry again. This apparently is just not my day.


th <- structure(list(status_id = c("x1047841705729306624",
"x1046966595610927105",

"x1047094786610552832", "x1046988542818308097", "x1046934493553221632",

"x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",

"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",

"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",

"@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
tell you man. You are the fucking messiah",

"@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
being hung over tomorrow vs. not fucking up your life ten years later.",

"I tend to think about my dreams before I sleep.", "@MichaelAvenatti
@SenatorCollins So,  if your client was in her 20s, attending parties with
teenagers, doesn't that make her at the least immature as hell, or at the
worst, a pedophile and a person contributing to the delinquency of minors?",


"i wish i could take credit for this"), lat = c(43.6835853, 40.284123,

37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,

-83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426

), county_name = c("Cumberland County", "Delaware County", "San Francisco
County",

"Allegheny County", "Concho County", "Los Angeles County"), fips = c(23005L,


39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",

"Ohio", "California", "Pennsylvania", "Texas", "California"),

    state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
c("Medium Metro",

    "Large Fringe Metro", "Large Central Metro", "Large Central Metro",

    "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,

    2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,

    1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,

-6L))


st <- structure(list(terms = c("me abused depressed", "me hurt depressed",

"feel hopeless depressed", "feel alone depressed", "i feel helpless",

"i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",

"tbl", "data.frame"))

On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <[hidden email]>
wrote:

> Thanks all for your patience. Here’s a second go that is perhaps more
> explicative of what it is I am trying to accomplish (and hopefully in plain
> text form)...
>
>
> I’m using the following packages: tidyverse, purrr, tidytext
>
>
> I have a number of tweets in the following form:
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>
> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
> me tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> being hung over tomorrow vs. not fucking up your life ten years later.",
>
> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> @SenatorCollins So, if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at the
> worst, a pedophile and a person contributing to the delinquency of minors?",
>
> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles County"), fips =
> c(23005L,
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>
> state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = c("Medium
> Metro",
>
> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>
> "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>
> 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
> 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>
> ), row.names = c(NA, -6L), .internal.selfref = )
>
>
> I also have a number of search terms in the following form:
>
>
> st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>
> "tbl", "data.frame”))
>
>
> I am trying to isolate the tweets that contain all of the words in each of
> the search terms, i.e “me” “abused” and “depressed” from the first example
> search term, but they do not have to be in order or even next to one
> another.
>
>
> I am familiar with the dplyr suite of tools and have been attempting to
> generate some sort of ‘filter()’ to do this. I am not very familiar with
> purrr, but there may be a solution using the map function? I have also
> explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
> data in the following way:
>
>
> > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>
> > head(tt)
>
> status_id created_at lat lng
>
> 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> county_name fips state_name state_abb urban_level urban_code
>
> 1: Cumberland County 23005 Maine ME Medium Metro 3
>
> 2: Cumberland County 23005 Maine ME Medium Metro 3
>
> 3: Cumberland County 23005 Maine ME Medium Metro 3
>
> 4: Cumberland County 23005 Maine ME Medium Metro 3
>
> 5: Cumberland County 23005 Maine ME Medium Metro 3
>
> 6: Cumberland County 23005 Maine ME Medium Metro 3
>
> population word
>
> 1: 277308 technique
>
> 2: 277308 is
>
> 3: 277308 everything
>
> 4: 277308 with
>
> 5: 277308 olympic
>
> 6: 277308 lifts
>
>
> but once I have unnested the tokens, I am unable to recombine them back
> into tweets.
>
>
> Ideally the end result would append a new column to the ‘th’ data that
> would flag a tweet that contained all of the search words for any of the
> search terms; so the work flow would look like
>
> 1) look for all search words for one search term in a tweet
>
> 2) if all of the search words in the search term are found, create a flag
> (mutate(flag = 1) or some such)
>
> 3) do this for all of the tweets
>
> 4) move on the next search term and repeat
>
>
> Again, my thanks for your patience.
>
>
> --
>
>
> Nate Parsons
>
> Pronouns: He, Him, His
>
> Graduate Teaching Assistant
>
> Department of Sociology
>
> Portland State University
>
> Portland, Oregon
>
>
> 503-725-9025
>
> 503-725-3957 FAX
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter-2
The problem wasn't the data tibbles. You posted in html -- which you were
explictly warned against -- and that corrupted your text (e.g. some quotes
became "smart quotes", which cannot be properly cut and pasted into R).

Bert


On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <[hidden email]>
wrote:

> Argh! Here are those two example datasets as data frames (not tibbles).
> Sorry again. This apparently is just not my day.
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>
> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
> tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> being hung over tomorrow vs. not fucking up your life ten years later.",
>
> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> @SenatorCollins So,  if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at the
> worst, a pedophile and a person contributing to the delinquency of
> minors?",
>
>
> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles County"), fips =
> c(23005L,
>
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>
>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> c("Medium Metro",
>
>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>
>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>
>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
>
> -6L))
>
>
> st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>
> "tbl", "data.frame"))
>
> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <[hidden email]
> >
> wrote:
>
> > Thanks all for your patience. Here’s a second go that is perhaps more
> > explicative of what it is I am trying to accomplish (and hopefully in
> plain
> > text form)...
> >
> >
> > I’m using the following packages: tidyverse, purrr, tidytext
> >
> >
> > I have a number of tweets in the following form:
> >
> >
> > th <- structure(list(status_id = c("x1047841705729306624",
> > "x1046966595610927105",
> >
> > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
> >
> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> >
> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
> >
> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
> >
> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
> > me tell you man. You are the fucking messiah",
> >
> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> > being hung over tomorrow vs. not fucking up your life ten years later.",
> >
> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > @SenatorCollins So, if your client was in her 20s, attending parties with
> > teenagers, doesn't that make her at the least immature as hell, or at the
> > worst, a pedophile and a person contributing to the delinquency of
> minors?",
> >
> > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
> >
> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> >
> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> >
> > ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> > County",
> >
> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
> > c(23005L,
> >
> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> >
> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> >
> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> c("Medium
> > Metro",
> >
> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> >
> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> >
> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> >
> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
> >
> > ), row.names = c(NA, -6L), .internal.selfref = )
> >
> >
> > I also have a number of search terms in the following form:
> >
> >
> > st <- structure(list(terms = c("me abused depressed", "me hurt
> depressed",
> >
> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
> >
> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> >
> > "tbl", "data.frame”))
> >
> >
> > I am trying to isolate the tweets that contain all of the words in each
> of
> > the search terms, i.e “me” “abused” and “depressed” from the first
> example
> > search term, but they do not have to be in order or even next to one
> > another.
> >
> >
> > I am familiar with the dplyr suite of tools and have been attempting to
> > generate some sort of ‘filter()’ to do this. I am not very familiar with
> > purrr, but there may be a solution using the map function? I have also
> > explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
> > data in the following way:
> >
> >
> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
> >
> > > head(tt)
> >
> > status_id created_at lat lng
> >
> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> >
> > county_name fips state_name state_abb urban_level urban_code
> >
> > 1: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 2: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 3: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 4: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 5: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > 6: Cumberland County 23005 Maine ME Medium Metro 3
> >
> > population word
> >
> > 1: 277308 technique
> >
> > 2: 277308 is
> >
> > 3: 277308 everything
> >
> > 4: 277308 with
> >
> > 5: 277308 olympic
> >
> > 6: 277308 lifts
> >
> >
> > but once I have unnested the tokens, I am unable to recombine them back
> > into tweets.
> >
> >
> > Ideally the end result would append a new column to the ‘th’ data that
> > would flag a tweet that contained all of the search words for any of the
> > search terms; so the work flow would look like
> >
> > 1) look for all search words for one search term in a tweet
> >
> > 2) if all of the search words in the search term are found, create a flag
> > (mutate(flag = 1) or some such)
> >
> > 3) do this for all of the tweets
> >
> > 4) move on the next search term and repeat
> >
> >
> > Again, my thanks for your patience.
> >
> >
> > --
> >
> >
> > Nate Parsons
> >
> > Pronouns: He, Him, His
> >
> > Graduate Teaching Assistant
> >
> > Department of Sociology
> >
> > Portland State University
> >
> > Portland, Oregon
> >
> >
> > 503-725-9025
> >
> > 503-725-3957 FAX
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter-2
OK, as no one else has offered a solution, I'll take a whack at it.

Caveats: This is a brute force attempt using R's basic regular expression
engine. It is inelegant and barely tested, so likely to be at best
incomplete and buggy, and at worst, incorrect. But maybe Nathan or someone
else on the list can fix it up. So if (when) it breaks, complain on the
list to give someone (almost certainly not me) the opportunity.

The basic idea is that the tweets are just character strings and the search
phrases are just character vectors all of whose elements must match
"appropriately" -- i.e. they must match whole words -- in the character
strings. So my desired output from the code is a list indexed by the search
phrases, each of whose components if a logical vector of length the number
of tweets each of whose elements = TRUE iff all the words in the search
phrase match somewhere in the tweet.

Here's the code(using the data Nathan provided):

> words <- sapply(st[[1]],strsplit,split = " +" )
## convert the phrases to a list of character vectors of the words
## Result:
> words
$`me abused depressed`
[1] "me"        "abused"    "depressed"

$`me hurt depressed`
[1] "me"        "hurt"      "depressed"

$`feel hopeless depressed`
[1] "feel"      "hopeless"  "depressed"

$`feel alone depressed`
[1] "feel"      "alone"     "depressed"

$`i feel helpless`
[1] "i"        "feel"     "helpless"

$`i feel worthless`
[1] "i"         "feel"      "worthless"

> expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," ","
"),x, c(" "," "," *$")))
## function to create regexes for words when they are at the beginning,
middle, or end of tweets

> wordregex <- lapply(words,expand.words)
##Result
## too lengthy to include
##
> tweets <- th$text
##extract the tweets
> findin <- function(x,y)
   ## x is a vector of regex patterns
   ## y is a character vector
   ## value = vector,vec, with length(vec) == length(y) and vec[i] == TRUE
iff any of x matches y[i]
{ apply(sapply(x,function(z)grepl(z,y)), 1,any)
}

## add a matching "tweet" to the tweet vector:
> tweets <- c(tweets," i xxxx worthless yxxc ght feel")

> ans <-
lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
all))
## Result:
> ans
$`me abused depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`me hurt depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel hopeless depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel alone depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel helpless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel worthless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

## None of the tweets match any of the phrases except for the last tweet
that I added.

## Note: you need to add capabilities to handle upper and lower case. See,
e.g. ?casefold

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <[hidden email]> wrote:

> The problem wasn't the data tibbles. You posted in html -- which you were
> explictly warned against -- and that corrupted your text (e.g. some quotes
> became "smart quotes", which cannot be properly cut and pasted into R).
>
> Bert
>
>
> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <[hidden email]>
> wrote:
>
>> Argh! Here are those two example datasets as data frames (not tibbles).
>> Sorry again. This apparently is just not my day.
>>
>>
>> th <- structure(list(status_id = c("x1047841705729306624",
>> "x1046966595610927105",
>>
>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>
>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>
>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>
>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>>
>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
>> me
>> tell you man. You are the fucking messiah",
>>
>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
>> being hung over tomorrow vs. not fucking up your life ten years later.",
>>
>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>> @SenatorCollins So,  if your client was in her 20s, attending parties with
>> teenagers, doesn't that make her at the least immature as hell, or at the
>> worst, a pedophile and a person contributing to the delinquency of
>> minors?",
>>
>>
>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>
>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>
>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>
>> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
>> County",
>>
>> "Allegheny County", "Concho County", "Los Angeles County"), fips =
>> c(23005L,
>>
>>
>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>
>> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>
>>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>> c("Medium Metro",
>>
>>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>
>>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>
>>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>
>>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
>>
>> -6L))
>>
>>
>> st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
>>
>> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>
>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>
>> "tbl", "data.frame"))
>>
>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <
>> [hidden email]>
>> wrote:
>>
>> > Thanks all for your patience. Here’s a second go that is perhaps more
>> > explicative of what it is I am trying to accomplish (and hopefully in
>> plain
>> > text form)...
>> >
>> >
>> > I’m using the following packages: tidyverse, purrr, tidytext
>> >
>> >
>> > I have a number of tweets in the following form:
>> >
>> >
>> > th <- structure(list(status_id = c("x1047841705729306624",
>> > "x1046966595610927105",
>> >
>> > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>> >
>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>> >
>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>> >
>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>> > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt
>> ",
>> >
>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
>> > me tell you man. You are the fucking messiah",
>> >
>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>> not
>> > being hung over tomorrow vs. not fucking up your life ten years later.",
>> >
>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>> > @SenatorCollins So, if your client was in her 20s, attending parties
>> with
>> > teenagers, doesn't that make her at the least immature as hell, or at
>> the
>> > worst, a pedophile and a person contributing to the delinquency of
>> minors?",
>> >
>> > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>> >
>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>> >
>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>> >
>> > ), county_name = c("Cumberland County", "Delaware County", "San
>> Francisco
>> > County",
>> >
>> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
>> > c(23005L,
>> >
>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>> >
>> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
>> >
>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>> c("Medium
>> > Metro",
>> >
>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>> >
>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>> >
>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>> >
>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>> >
>> > ), row.names = c(NA, -6L), .internal.selfref = )
>> >
>> >
>> > I also have a number of search terms in the following form:
>> >
>> >
>> > st <- structure(list(terms = c("me abused depressed", "me hurt
>> depressed",
>> >
>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>> >
>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>> >
>> > "tbl", "data.frame”))
>> >
>> >
>> > I am trying to isolate the tweets that contain all of the words in each
>> of
>> > the search terms, i.e “me” “abused” and “depressed” from the first
>> example
>> > search term, but they do not have to be in order or even next to one
>> > another.
>> >
>> >
>> > I am familiar with the dplyr suite of tools and have been attempting to
>> > generate some sort of ‘filter()’ to do this. I am not very familiar with
>> > purrr, but there may be a solution using the map function? I have also
>> > explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
>> > data in the following way:
>> >
>> >
>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>> >
>> > > head(tt)
>> >
>> > status_id created_at lat lng
>> >
>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>> >
>> > county_name fips state_name state_abb urban_level urban_code
>> >
>> > 1: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > 2: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > 3: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > 4: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > 5: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > 6: Cumberland County 23005 Maine ME Medium Metro 3
>> >
>> > population word
>> >
>> > 1: 277308 technique
>> >
>> > 2: 277308 is
>> >
>> > 3: 277308 everything
>> >
>> > 4: 277308 with
>> >
>> > 5: 277308 olympic
>> >
>> > 6: 277308 lifts
>> >
>> >
>> > but once I have unnested the tokens, I am unable to recombine them back
>> > into tweets.
>> >
>> >
>> > Ideally the end result would append a new column to the ‘th’ data that
>> > would flag a tweet that contained all of the search words for any of the
>> > search terms; so the work flow would look like
>> >
>> > 1) look for all search words for one search term in a tweet
>> >
>> > 2) if all of the search words in the search term are found, create a
>> flag
>> > (mutate(flag = 1) or some such)
>> >
>> > 3) do this for all of the tweets
>> >
>> > 4) move on the next search term and repeat
>> >
>> >
>> > Again, my thanks for your patience.
>> >
>> >
>> > --
>> >
>> >
>> > Nate Parsons
>> >
>> > Pronouns: He, Him, His
>> >
>> > Graduate Teaching Assistant
>> >
>> > Department of Sociology
>> >
>> > Portland State University
>> >
>> > Portland, Oregon
>> >
>> >
>> > 503-725-9025
>> >
>> > 503-725-3957 FAX
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Nathan Parsons
I do not have your command of base r, Bert. That is a herculean effort! Here’s what I spent my night putting together:

## Create search terms
## dput(st)
st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
), word3 = c("lifts", "depressed", "depressed", "depressed",
"depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))

## Create tweets
## dput(th)
th <- structure(list(status_id = c("x1047841705729306624", "x1046966595610927105",
"x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
"x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
"x1048269727582355457", "x1048092408544677890"), created_at = c("2018-10-04T13:31:45Z",
"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
"2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
), text = c("technique is everything with olympic lifts ! @ body by john ",
"@subtronics just went back and rewatched ur fblice with ur cdjs and let me tell you man. you are the fucking messiah",
"@ic4rus1 opportunistic means short-game. as in getting drunk now vs. not being hung over tomorrow vs. not fucking up your life ten years later.",
"i tend to think about my dreams before i sleep.", "@michaelavenatti @senatorcollins so if your client was in her 20s attending parties with teenagers doesnt that make her at the least immature as hell or at the worst a pedophile and a person contributing to the delinquency of minors?",
"i wish i could take credit for this", "i woulda never imagined. #lakeshow ",
"@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
"sunday ill have a booth in katy at the real craft wives of katy fest @nolabelbrewco cmon yall!everything is better when you top it with tias!order today we ship to all 50 ",
"dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
-100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
-86.2447285), county_name = c("Cumberland County", "Delaware County",
"San Francisco County", "Allegheny County", "Concho County",
"Los Angeles County", "Los Angeles County", "Marathon County",
"Harris County", "Montgomery County"), fips = c(23005L, 39041L,
6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
state_name = c("Maine", "Ohio", "California", "Pennsylvania",
"Texas", "California", "California", "Wisconsin", "Texas",
"Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
"CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
"Large Fringe Metro", "Large Central Metro", "Large Central Metro",
"NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
"Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
10L), class = "data.frame")

## Clean tweets - basically just remove everything we don’t need from the text including punctuation and urls
th %>%
mutate(linenumber = row_number(),
text = str_remove_all(text, "[^\x01-\x7F]"),
text = str_remove_all(text, "\n"),
text = str_remove_all(text, ","),
text = str_remove_all(text, "'"),
text = str_remove_all(text, "&"),
text = str_remove_all(text, "<"),
text = str_remove_all(text, ">"),
text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
text = tolower(text)) -> th

## Create search function that looks for each search term in the provided string, evaluates if all three search terms have been found, and returns a logical
srchr <- function(df) {
str_detect(df, "olympic") -> a
str_detect(df, "technique") -> b
str_detect(df, "lifts") -> c
ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
}

## Evaluate tweets for presence of search term
th %>%
mutate(flag = map_chr(text, srchr)) -> th_flagged

As far as I can tell, this works. I have to manually enter each set of search terms into the function, which is not ideal. Also, this only generates a True/False for each tweet based on one search term - I end up with an evaluatory column for each search term that I would then have to collapse together somehow. I’m sure there’s a more elegant solution.

--

Nate Parsons
Pronouns: He, Him, His
Graduate Teaching Assistant
Department of Sociology
Portland State University
Portland, Oregon

503-725-9025
503-725-3957 FAX
On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <[hidden email]>, wrote:

> OK, as no one else has offered a solution, I'll take a whack at it.
>
> Caveats: This is a brute force attempt using R's basic regular expression engine. It is inelegant and barely tested, so likely to be at best incomplete and buggy, and at worst, incorrect. But maybe Nathan or someone else on the list can fix it up. So if (when) it breaks, complain on the list to give someone (almost certainly not me) the opportunity.
>
> The basic idea is that the tweets are just character strings and the search phrases are just character vectors all of whose elements must match "appropriately" -- i.e. they must match whole words -- in the character strings. So my desired output from the code is a list indexed by the search phrases, each of whose components if a logical vector of length the number of tweets each of whose elements = TRUE iff all the words in the search phrase match somewhere in the tweet.
>
> Here's the code(using the data Nathan provided):
>
> > words <- sapply(st[[1]],strsplit,split = " +" )
> ## convert the phrases to a list of character vectors of the words
> ## Result:
> > words
> $`me abused depressed`
> [1] "me"        "abused"    "depressed"
>
> $`me hurt depressed`
> [1] "me"        "hurt"      "depressed"
>
> $`feel hopeless depressed`
> [1] "feel"      "hopeless"  "depressed"
>
> $`feel alone depressed`
> [1] "feel"      "alone"     "depressed"
>
> $`i feel helpless`
> [1] "i"        "feel"     "helpless"
>
> $`i feel worthless`
> [1] "i"         "feel"      "worthless"
>
> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," "," "),x, c(" "," "," *$")))
> ## function to create regexes for words when they are at the beginning, middle, or end of tweets
>
> > wordregex <- lapply(words,expand.words)
> ##Result
> ## too lengthy to include
> ##
> > tweets <- th$text
> ##extract the tweets
> > findin <- function(x,y)
>    ## x is a vector of regex patterns
>    ## y is a character vector
>    ## value = vector,vec, with length(vec) == length(y) and vec[i] == TRUE iff any of x matches y[i]
> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
> }
>
> ## add a matching "tweet" to the tweet vector:
> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>
> > ans <- lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1, all))
> ## Result:
> > ans
> $`me abused depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`me hurt depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel hopeless depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel alone depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel helpless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel worthless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>
> ## None of the tweets match any of the phrases except for the last tweet that I added.
>
> ## Note: you need to add capabilities to handle upper and lower case. See, e.g. ?casefold
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> > On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <[hidden email]> wrote:
> > > The problem wasn't the data tibbles. You posted in html -- which you were explictly warned against -- and that corrupted your text (e.g. some quotes became "smart quotes", which cannot be properly cut and pasted into R).
> > >
> > > Bert
> > >
> > >
> > > > On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <[hidden email]> wrote:
> > > > > Argh! Here are those two example datasets as data frames (not tibbles).
> > > > > Sorry again. This apparently is just not my day.
> > > > >
> > > > >
> > > > > th <- structure(list(status_id = c("x1047841705729306624",
> > > > > "x1046966595610927105",
> > > > >
> > > > > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
> > > > >
> > > > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> > > > >
> > > > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
> > > > >
> > > > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> > > > > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
> > > > >
> > > > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
> > > > > tell you man. You are the fucking messiah",
> > > > >
> > > > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> > > > > being hung over tomorrow vs. not fucking up your life ten years later.",
> > > > >
> > > > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > > > > @SenatorCollins So,  if your client was in her 20s, attending parties with
> > > > > teenagers, doesn't that make her at the least immature as hell, or at the
> > > > > worst, a pedophile and a person contributing to the delinquency of minors?",
> > > > >
> > > > >
> > > > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
> > > > >
> > > > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> > > > >
> > > > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> > > > >
> > > > > ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> > > > > County",
> > > > >
> > > > > "Allegheny County", "Concho County", "Los Angeles County"), fips = c(23005L,
> > > > >
> > > > >
> > > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> > > > >
> > > > > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> > > > >
> > > > >     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
> > > > > c("Medium Metro",
> > > > >
> > > > >     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> > > > >
> > > > >     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> > > > >
> > > > >     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> > > > >
> > > > >     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
> > > > >
> > > > > -6L))
> > > > >
> > > > >
> > > > > st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
> > > > >
> > > > > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
> > > > >
> > > > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> > > > >
> > > > > "tbl", "data.frame"))
> > > > >
> > > > > On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <[hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks all for your patience. Here’s a second go that is perhaps more
> > > > > > explicative of what it is I am trying to accomplish (and hopefully in plain
> > > > > > text form)...
> > > > > >
> > > > > >
> > > > > > I’m using the following packages: tidyverse, purrr, tidytext
> > > > > >
> > > > > >
> > > > > > I have a number of tweets in the following form:
> > > > > >
> > > > > >
> > > > > > th <- structure(list(status_id = c("x1047841705729306624",
> > > > > > "x1046966595610927105",
> > > > > >
> > > > > > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
> > > > > >
> > > > > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
> > > > > >
> > > > > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
> > > > > >
> > > > > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> > > > > > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
> > > > > >
> > > > > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
> > > > > > me tell you man. You are the fucking messiah",
> > > > > >
> > > > > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> > > > > > being hung over tomorrow vs. not fucking up your life ten years later.",
> > > > > >
> > > > > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> > > > > > @SenatorCollins So, if your client was in her 20s, attending parties with
> > > > > > teenagers, doesn't that make her at the least immature as hell, or at the
> > > > > > worst, a pedophile and a person contributing to the delinquency of minors?",
> > > > > >
> > > > > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
> > > > > >
> > > > > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
> > > > > >
> > > > > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
> > > > > >
> > > > > > ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> > > > > > County",
> > > > > >
> > > > > > "Allegheny County", "Concho County", "Los Angeles County"), fips =
> > > > > > c(23005L,
> > > > > >
> > > > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
> > > > > >
> > > > > > "Ohio", "California", "Pennsylvania", "Texas", "California"),
> > > > > >
> > > > > > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = c("Medium
> > > > > > Metro",
> > > > > >
> > > > > > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> > > > > >
> > > > > > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
> > > > > >
> > > > > > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
> > > > > >
> > > > > > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
> > > > > >
> > > > > > ), row.names = c(NA, -6L), .internal.selfref = )
> > > > > >
> > > > > >
> > > > > > I also have a number of search terms in the following form:
> > > > > >
> > > > > >
> > > > > > st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
> > > > > >
> > > > > > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
> > > > > >
> > > > > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
> > > > > >
> > > > > > "tbl", "data.frame”))
> > > > > >
> > > > > >
> > > > > > I am trying to isolate the tweets that contain all of the words in each of
> > > > > > the search terms, i.e “me” “abused” and “depressed” from the first example
> > > > > > search term, but they do not have to be in order or even next to one
> > > > > > another.
> > > > > >
> > > > > >
> > > > > > I am familiar with the dplyr suite of tools and have been attempting to
> > > > > > generate some sort of ‘filter()’ to do this. I am not very familiar with
> > > > > > purrr, but there may be a solution using the map function? I have also
> > > > > > explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
> > > > > > data in the following way:
> > > > > >
> > > > > >
> > > > > > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
> > > > > >
> > > > > > > head(tt)
> > > > > >
> > > > > > status_id created_at lat lng
> > > > > >
> > > > > > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
> > > > > >
> > > > > > county_name fips state_name state_abb urban_level urban_code
> > > > > >
> > > > > > 1: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 2: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 3: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 4: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 5: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > 6: Cumberland County 23005 Maine ME Medium Metro 3
> > > > > >
> > > > > > population word
> > > > > >
> > > > > > 1: 277308 technique
> > > > > >
> > > > > > 2: 277308 is
> > > > > >
> > > > > > 3: 277308 everything
> > > > > >
> > > > > > 4: 277308 with
> > > > > >
> > > > > > 5: 277308 olympic
> > > > > >
> > > > > > 6: 277308 lifts
> > > > > >
> > > > > >
> > > > > > but once I have unnested the tokens, I am unable to recombine them back
> > > > > > into tweets.
> > > > > >
> > > > > >
> > > > > > Ideally the end result would append a new column to the ‘th’ data that
> > > > > > would flag a tweet that contained all of the search words for any of the
> > > > > > search terms; so the work flow would look like
> > > > > >
> > > > > > 1) look for all search words for one search term in a tweet
> > > > > >
> > > > > > 2) if all of the search words in the search term are found, create a flag
> > > > > > (mutate(flag = 1) or some such)
> > > > > >
> > > > > > 3) do this for all of the tweets
> > > > > >
> > > > > > 4) move on the next search term and repeat
> > > > > >
> > > > > >
> > > > > > Again, my thanks for your patience.
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > Nate Parsons
> > > > > >
> > > > > > Pronouns: He, Him, His
> > > > > >
> > > > > > Graduate Teaching Assistant
> > > > > >
> > > > > > Department of Sociology
> > > > > >
> > > > > > Portland State University
> > > > > >
> > > > > > Portland, Oregon
> > > > > >
> > > > > >
> > > > > > 503-725-9025
> > > > > >
> > > > > > 503-725-3957 FAX
> > > > > >
> > > > >
> > > > >         [[alternative HTML version deleted]]
> > > > >
> > > > > ______________________________________________
> > > > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter-2
If you wish to use R, you need to at least understand its basic data
structures and functionality. Expecting that mimickry of code in special
packages will suffice is, I believe, an illusion. If you haven't already
done so, you should go through a basic R tutorial or two (there are many on
the web; some recommendations, by no means necessarily "the best",  can be
found here:
https://www.rstudio.com/online-learning/#r-programming).

Having said that, I realized that my previous "solution" using regular
expressions was more complicated than it needed to be and somewhat foolish
( so much for all my "expertise"). A simpler and better approach is simply
to break up both the tweet texts and your search phrases into vectors of
their "words" (i.e. character strings surrounded by spaces) using
strplit(), and then using R's built-in matching capabilities with %in%.
This is quite straightforward, pretty robust (no regex's to wrestle with),
and does not require "herculean efforts" to understand. The only wrinkle is
some bookkeeping with the "apply" family of functions. These are, as you
may know, the functional programming way of handling iteration (loops), but
they are what I would consider part of "basic" R functionality and worth
spending the time to learn about.

Herewith my better, simpler proposal, using your example data as before:

getwords <- function(x)strsplit(tolower(x),split = " +")
## split text into a vector of lower-cased "words"

phrasewords <- structure(getwords(st$terms), names = st$terms)
## named list of your search word vectors

tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel"))
## the tweets + one additional that should match the last phrase

ans <- lapply(phrasewords, function(x) apply(sapply(tweets,function(y)x
%in% y), 2, all))
## a list indexed by the search phrases,
## with each component a vector of logicals with vec[i] == TRUE iff
## the ith tweet contains all the words in the search phrase

> ans
$`me abused depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`me hurt depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel hopeless depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel alone depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel helpless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel worthless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

-- Bert

On Wed, Oct 17, 2018 at 9:20 AM Nathan Parsons <[hidden email]>
wrote:

> I do not have your command of base r, Bert. That is a herculean effort!
> Here’s what I spent my night putting together:
>
> ## Create search terms
> ## dput(st)
> st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
> ), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
> ), word3 = c("lifts", "depressed", "depressed", "depressed",
> "depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
> -5L))
>
> ## Create tweets
> ## dput(th)
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
> "x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
> "x1048269727582355457", "x1048092408544677890"), created_at =
> c("2018-10-04T13:31:45Z",
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
> "2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
> ), text = c("technique is everything with olympic lifts ! @ body by john ",
> "@subtronics just went back and rewatched ur fblice with ur cdjs and let
> me tell you man. you are the fucking messiah",
> "@ic4rus1 opportunistic means short-game. as in getting drunk now vs. not
> being hung over tomorrow vs. not fucking up your life ten years later.",
> "i tend to think about my dreams before i sleep.", "@michaelavenatti
> @senatorcollins so if your client was in her 20s attending parties with
> teenagers doesnt that make her at the least immature as hell or at the
> worst a pedophile and a person contributing to the delinquency of minors?",
> "i wish i could take credit for this", "i woulda never imagined. #lakeshow
> ",
> "@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
> "sunday ill have a booth in katy at the real craft wives of katy fest
> @nolabelbrewco cmon yall!everything is better when you top it with
> tias!order today we ship to all 50 ",
> "dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
> 40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
> 32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
> -100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
> -86.2447285), county_name = c("Cumberland County", "Delaware County",
> "San Francisco County", "Allegheny County", "Concho County",
> "Los Angeles County", "Los Angeles County", "Marathon County",
> "Harris County", "Montgomery County"), fips = c(23005L, 39041L,
> 6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
> state_name = c("Maine", "Ohio", "California", "Pennsylvania",
> "Texas", "California", "California", "Wisconsin", "Texas",
> "Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
> "CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
> "NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
> "Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
> 2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
> 184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
> 4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
> 10L), class = "data.frame")
>
> ## Clean tweets - basically just remove everything we don’t need from the
> text including punctuation and urls
> th %>%
> mutate(linenumber = row_number(),
> text = str_remove_all(text, "[^\x01-\x7F]"),
> text = str_remove_all(text, "\n"),
> text = str_remove_all(text, ","),
> text = str_remove_all(text, "'"),
> text = str_remove_all(text, "&"),
> text = str_remove_all(text, "<"),
> text = str_remove_all(text, ">"),
> text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
> text = tolower(text)) -> th
>
> ## Create search function that looks for each search term in the provided
> string, evaluates if all three search terms have been found, and returns a
> logical
> srchr <- function(df) {
> str_detect(df, "olympic") -> a
> str_detect(df, "technique") -> b
> str_detect(df, "lifts") -> c
> ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
> }
>
> ## Evaluate tweets for presence of search term
> th %>%
> mutate(flag = map_chr(text, srchr)) -> th_flagged
>
> As far as I can tell, this works. I have to manually enter each set of
> search terms into the function, which is not ideal. Also, this only
> generates a True/False for each tweet based on one search term - I end up
> with an evaluatory column for each search term that I would then have to
> collapse together somehow. I’m sure there’s a more elegant solution.
>
> --
>
> Nate Parsons
> Pronouns: He, Him, His
> Graduate Teaching Assistant
> Department of Sociology
> Portland State University
> Portland, Oregon
>
> 503-725-9025
> 503-725-3957 FAX
> On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <[hidden email]>,
> wrote:
>
> OK, as no one else has offered a solution, I'll take a whack at it.
>
> Caveats: This is a brute force attempt using R's basic regular expression
> engine. It is inelegant and barely tested, so likely to be at best
> incomplete and buggy, and at worst, incorrect. But maybe Nathan or someone
> else on the list can fix it up. So if (when) it breaks, complain on the
> list to give someone (almost certainly not me) the opportunity.
>
> The basic idea is that the tweets are just character strings and the
> search phrases are just character vectors all of whose elements must match
> "appropriately" -- i.e. they must match whole words -- in the character
> strings. So my desired output from the code is a list indexed by the search
> phrases, each of whose components if a logical vector of length the number
> of tweets each of whose elements = TRUE iff all the words in the search
> phrase match somewhere in the tweet.
>
> Here's the code(using the data Nathan provided):
>
> > words <- sapply(st[[1]],strsplit,split = " +" )
> ## convert the phrases to a list of character vectors of the words
> ## Result:
> > words
> $`me abused depressed`
> [1] "me"        "abused"    "depressed"
>
> $`me hurt depressed`
> [1] "me"        "hurt"      "depressed"
>
> $`feel hopeless depressed`
> [1] "feel"      "hopeless"  "depressed"
>
> $`feel alone depressed`
> [1] "feel"      "alone"     "depressed"
>
> $`i feel helpless`
> [1] "i"        "feel"     "helpless"
>
> $`i feel worthless`
> [1] "i"         "feel"      "worthless"
>
> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," ","
> "),x, c(" "," "," *$")))
> ## function to create regexes for words when they are at the beginning,
> middle, or end of tweets
>
> > wordregex <- lapply(words,expand.words)
> ##Result
> ## too lengthy to include
> ##
> > tweets <- th$text
> ##extract the tweets
> > findin <- function(x,y)
>    ## x is a vector of regex patterns
>    ## y is a character vector
>    ## value = vector,vec, with length(vec) == length(y) and vec[i] == TRUE
> iff any of x matches y[i]
> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
> }
>
> ## add a matching "tweet" to the tweet vector:
> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>
> > ans <-
> lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
> all))
> ## Result:
> > ans
> $`me abused depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`me hurt depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel hopeless depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel alone depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel helpless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel worthless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>
> ## None of the tweets match any of the phrases except for the last tweet
> that I added.
>
> ## Note: you need to add capabilities to handle upper and lower case. See,
> e.g. ?casefold
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <[hidden email]>
> wrote:
>
>> The problem wasn't the data tibbles. You posted in html -- which you were
>> explictly warned against -- and that corrupted your text (e.g. some quotes
>> became "smart quotes", which cannot be properly cut and pasted into R).
>>
>> Bert
>>
>>
>> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <
>> [hidden email]> wrote:
>>
>>> Argh! Here are those two example datasets as data frames (not tibbles).
>>> Sorry again. This apparently is just not my day.
>>>
>>>
>>> th <- structure(list(status_id = c("x1047841705729306624",
>>> "x1046966595610927105",
>>>
>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>>
>>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>
>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>>
>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>>>
>>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let
>>> me
>>> tell you man. You are the fucking messiah",
>>>
>>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
>>> being hung over tomorrow vs. not fucking up your life ten years later.",
>>>
>>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>> @SenatorCollins So,  if your client was in her 20s, attending parties
>>> with
>>> teenagers, doesn't that make her at the least immature as hell, or at the
>>> worst, a pedophile and a person contributing to the delinquency of
>>> minors?",
>>>
>>>
>>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>>
>>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>
>>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>
>>> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
>>> County",
>>>
>>> "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>> c(23005L,
>>>
>>>
>>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>
>>> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>
>>>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>> c("Medium Metro",
>>>
>>>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>
>>>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>
>>>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>
>>>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
>>>
>>> -6L))
>>>
>>>
>>> st <- structure(list(terms = c("me abused depressed", "me hurt
>>> depressed",
>>>
>>> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>
>>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>
>>> "tbl", "data.frame"))
>>>
>>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <
>>> [hidden email]>
>>> wrote:
>>>
>>> > Thanks all for your patience. Here’s a second go that is perhaps more
>>> > explicative of what it is I am trying to accomplish (and hopefully in
>>> plain
>>> > text form)...
>>> >
>>> >
>>> > I’m using the following packages: tidyverse, purrr, tidytext
>>> >
>>> >
>>> > I have a number of tweets in the following form:
>>> >
>>> >
>>> > th <- structure(list(status_id = c("x1047841705729306624",
>>> > "x1046966595610927105",
>>> >
>>> > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>> >
>>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>> >
>>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>> >
>>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>>> > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt
>>> ",
>>> >
>>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>> let
>>> > me tell you man. You are the fucking messiah",
>>> >
>>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>>> not
>>> > being hung over tomorrow vs. not fucking up your life ten years
>>> later.",
>>> >
>>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>> > @SenatorCollins So, if your client was in her 20s, attending parties
>>> with
>>> > teenagers, doesn't that make her at the least immature as hell, or at
>>> the
>>> > worst, a pedophile and a person contributing to the delinquency of
>>> minors?",
>>> >
>>> > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>> >
>>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>> >
>>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>> >
>>> > ), county_name = c("Cumberland County", "Delaware County", "San
>>> Francisco
>>> > County",
>>> >
>>> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>> > c(23005L,
>>> >
>>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>> >
>>> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>> >
>>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>> c("Medium
>>> > Metro",
>>> >
>>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>> >
>>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>> >
>>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>> >
>>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>>> >
>>> > ), row.names = c(NA, -6L), .internal.selfref = )
>>> >
>>> >
>>> > I also have a number of search terms in the following form:
>>> >
>>> >
>>> > st <- structure(list(terms = c("me abused depressed", "me hurt
>>> depressed",
>>> >
>>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>> >
>>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>> >
>>> > "tbl", "data.frame”))
>>> >
>>> >
>>> > I am trying to isolate the tweets that contain all of the words in
>>> each of
>>> > the search terms, i.e “me” “abused” and “depressed” from the first
>>> example
>>> > search term, but they do not have to be in order or even next to one
>>> > another.
>>> >
>>> >
>>> > I am familiar with the dplyr suite of tools and have been attempting to
>>> > generate some sort of ‘filter()’ to do this. I am not very familiar
>>> with
>>> > purrr, but there may be a solution using the map function? I have also
>>> > explored the tidytext ‘unnest_tokens’ function which transforms the
>>> ’th’
>>> > data in the following way:
>>> >
>>> >
>>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>>> >
>>> > > head(tt)
>>> >
>>> > status_id created_at lat lng
>>> >
>>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>> >
>>> > county_name fips state_name state_abb urban_level urban_code
>>> >
>>> > 1: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > 2: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > 3: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > 4: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > 5: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > 6: Cumberland County 23005 Maine ME Medium Metro 3
>>> >
>>> > population word
>>> >
>>> > 1: 277308 technique
>>> >
>>> > 2: 277308 is
>>> >
>>> > 3: 277308 everything
>>> >
>>> > 4: 277308 with
>>> >
>>> > 5: 277308 olympic
>>> >
>>> > 6: 277308 lifts
>>> >
>>> >
>>> > but once I have unnested the tokens, I am unable to recombine them back
>>> > into tweets.
>>> >
>>> >
>>> > Ideally the end result would append a new column to the ‘th’ data that
>>> > would flag a tweet that contained all of the search words for any of
>>> the
>>> > search terms; so the work flow would look like
>>> >
>>> > 1) look for all search words for one search term in a tweet
>>> >
>>> > 2) if all of the search words in the search term are found, create a
>>> flag
>>> > (mutate(flag = 1) or some such)
>>> >
>>> > 3) do this for all of the tweets
>>> >
>>> > 4) move on the next search term and repeat
>>> >
>>> >
>>> > Again, my thanks for your patience.
>>> >
>>> >
>>> > --
>>> >
>>> >
>>> > Nate Parsons
>>> >
>>> > Pronouns: He, Him, His
>>> >
>>> > Graduate Teaching Assistant
>>> >
>>> > Department of Sociology
>>> >
>>> > Portland State University
>>> >
>>> > Portland, Oregon
>>> >
>>> >
>>> > 503-725-9025
>>> >
>>> > 503-725-3957 FAX
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter-2
All (especially Nathan): **Please feel free to ignore this post without
response.** It just represents a bit of OCD-ness on my part that may or may
not be of interest to anyone else.

Purpose of this post: To give an alternative considerably simpler and
considerably faster solution to the problem than those which I offered
previously. It may or may not be what the OP asked for, but the improvement
exercise was instructive to me . Notation as previously in this thread.

New solution:

getwords <-
function(x)strsplit(gsub("(^[[:space:]]+)|([[:space:]]+)$)","",tolower(x)),split
= " +")
## split lower-cased text into a vector of "words"
## I made this a bit fancier to handle some "corner" cases, but the
previous simpler version may well suffice.

'%allin%' <- function(x, table)prod(match(x,table, nomatch = 0L)) > 0L
## a convenience function/operator that improves efficiency.

## lists of  search word vectors as before
phrasewords <- getwords(st$terms)
tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel")) ## the
tweets + one additional

## simpler approach just using indexing for the bookkeeping that nested
_apply
## loops previously were used for
ans <- expand.grid(phrases = seq_along(phrasewords),tweets =
seq_along(tweets), Result = FALSE)
ans$Result <- apply(ind,1,function(r)phrasewords[[r[1]]] %allin%
tweets[[r[2]]])

## ans is a data frame in which the first column indexes phrases and the
second tweets
## The ith row of ans$Result == TRUE iff all the words in the phrase
indexed by the ith row of the
##  phrase column are contained in the tweet indexed by that row's tweet
column.

This was way faster than my previous offerings.

Note also that just the matching phrases and tweets can be extracted as
usual by:

> ans[ans[,3],]
   phrases tweets Result
42       6      7   TRUE
## all the words in the 6th search phrase appeared in the 7th tweet.

** I promise to natter on about this no longer! **

Cheers,
Bert


On Wed, Oct 17, 2018 at 7:50 PM Bert Gunter <[hidden email]> wrote:

>
> If you wish to use R, you need to at least understand its basic data
> structures and functionality. Expecting that mimickry of code in special
> packages will suffice is, I believe, an illusion. If you haven't already
> done so, you should go through a basic R tutorial or two (there are many on
> the web; some recommendations, by no means necessarily "the best",  can be
> found here:
> https://www.rstudio.com/online-learning/#r-programming).
>
> Having said that, I realized that my previous "solution" using regular
> expressions was more complicated than it needed to be and somewhat foolish
> ( so much for all my "expertise"). A simpler and better approach is simply
> to break up both the tweet texts and your search phrases into vectors of
> their "words" (i.e. character strings surrounded by spaces) using
> strplit(), and then using R's built-in matching capabilities with %in%.
> This is quite straightforward, pretty robust (no regex's to wrestle with),
> and does not require "herculean efforts" to understand. The only wrinkle is
> some bookkeeping with the "apply" family of functions. These are, as you
> may know, the functional programming way of handling iteration (loops), but
> they are what I would consider part of "basic" R functionality and worth
> spending the time to learn about.
>
> Herewith my better, simpler proposal, using your example data as before:
>
> getwords <- function(x)strsplit(tolower(x),split = " +")
> ## split text into a vector of lower-cased "words"
>
> phrasewords <- structure(getwords(st$terms), names = st$terms)
> ## named list of your search word vectors
>
> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel"))
> ## the tweets + one additional that should match the last phrase
>
> ans <- lapply(phrasewords, function(x) apply(sapply(tweets,function(y)x
> %in% y), 2, all))
> ## a list indexed by the search phrases,
> ## with each component a vector of logicals with vec[i] == TRUE iff
> ## the ith tweet contains all the words in the search phrase
>
> > ans
> $`me abused depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`me hurt depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel hopeless depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`feel alone depressed`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel helpless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>
> $`i feel worthless`
> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>
> -- Bert
>
> On Wed, Oct 17, 2018 at 9:20 AM Nathan Parsons <[hidden email]>
> wrote:
>
>> I do not have your command of base r, Bert. That is a herculean effort!
>> Here’s what I spent my night putting together:
>>
>> ## Create search terms
>> ## dput(st)
>> st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
>> ), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
>> ), word3 = c("lifts", "depressed", "depressed", "depressed",
>> "depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
>> -5L))
>>
>> ## Create tweets
>> ## dput(th)
>> th <- structure(list(status_id = c("x1047841705729306624",
>> "x1046966595610927105",
>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>> "x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
>> "x1048269727582355457", "x1048092408544677890"), created_at =
>> c("2018-10-04T13:31:45Z",
>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
>> "2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
>> ), text = c("technique is everything with olympic lifts ! @ body by john
>> ",
>> "@subtronics just went back and rewatched ur fblice with ur cdjs and let
>> me tell you man. you are the fucking messiah",
>> "@ic4rus1 opportunistic means short-game. as in getting drunk now vs. not
>> being hung over tomorrow vs. not fucking up your life ten years later.",
>> "i tend to think about my dreams before i sleep.", "@michaelavenatti
>> @senatorcollins so if your client was in her 20s attending parties with
>> teenagers doesnt that make her at the least immature as hell or at the
>> worst a pedophile and a person contributing to the delinquency of minors?",
>> "i wish i could take credit for this", "i woulda never imagined.
>> #lakeshow ",
>> "@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
>> "sunday ill have a booth in katy at the real craft wives of katy fest
>> @nolabelbrewco cmon yall!everything is better when you top it with
>> tias!order today we ship to all 50 ",
>> "dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
>> 40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
>> 32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
>> -100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
>> -86.2447285), county_name = c("Cumberland County", "Delaware County",
>> "San Francisco County", "Allegheny County", "Concho County",
>> "Los Angeles County", "Los Angeles County", "Marathon County",
>> "Harris County", "Montgomery County"), fips = c(23005L, 39041L,
>> 6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
>> state_name = c("Maine", "Ohio", "California", "Pennsylvania",
>> "Texas", "California", "California", "Wisconsin", "Texas",
>> "Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
>> "CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
>> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>> "NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
>> "Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
>> 2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
>> 184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
>> 4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
>> 10L), class = "data.frame")
>>
>> ## Clean tweets - basically just remove everything we don’t need from the
>> text including punctuation and urls
>> th %>%
>> mutate(linenumber = row_number(),
>> text = str_remove_all(text, "[^\x01-\x7F]"),
>> text = str_remove_all(text, "\n"),
>> text = str_remove_all(text, ","),
>> text = str_remove_all(text, "'"),
>> text = str_remove_all(text, "&"),
>> text = str_remove_all(text, "<"),
>> text = str_remove_all(text, ">"),
>> text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
>> text = tolower(text)) -> th
>>
>> ## Create search function that looks for each search term in the provided
>> string, evaluates if all three search terms have been found, and returns a
>> logical
>> srchr <- function(df) {
>> str_detect(df, "olympic") -> a
>> str_detect(df, "technique") -> b
>> str_detect(df, "lifts") -> c
>> ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
>> }
>>
>> ## Evaluate tweets for presence of search term
>> th %>%
>> mutate(flag = map_chr(text, srchr)) -> th_flagged
>>
>> As far as I can tell, this works. I have to manually enter each set of
>> search terms into the function, which is not ideal. Also, this only
>> generates a True/False for each tweet based on one search term - I end up
>> with an evaluatory column for each search term that I would then have to
>> collapse together somehow. I’m sure there’s a more elegant solution.
>>
>> --
>>
>> Nate Parsons
>> Pronouns: He, Him, His
>> Graduate Teaching Assistant
>> Department of Sociology
>> Portland State University
>> Portland, Oregon
>>
>> 503-725-9025
>> 503-725-3957 FAX
>> On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <[hidden email]>,
>> wrote:
>>
>> OK, as no one else has offered a solution, I'll take a whack at it.
>>
>> Caveats: This is a brute force attempt using R's basic regular expression
>> engine. It is inelegant and barely tested, so likely to be at best
>> incomplete and buggy, and at worst, incorrect. But maybe Nathan or someone
>> else on the list can fix it up. So if (when) it breaks, complain on the
>> list to give someone (almost certainly not me) the opportunity.
>>
>> The basic idea is that the tweets are just character strings and the
>> search phrases are just character vectors all of whose elements must match
>> "appropriately" -- i.e. they must match whole words -- in the character
>> strings. So my desired output from the code is a list indexed by the search
>> phrases, each of whose components if a logical vector of length the number
>> of tweets each of whose elements = TRUE iff all the words in the search
>> phrase match somewhere in the tweet.
>>
>> Here's the code(using the data Nathan provided):
>>
>> > words <- sapply(st[[1]],strsplit,split = " +" )
>> ## convert the phrases to a list of character vectors of the words
>> ## Result:
>> > words
>> $`me abused depressed`
>> [1] "me"        "abused"    "depressed"
>>
>> $`me hurt depressed`
>> [1] "me"        "hurt"      "depressed"
>>
>> $`feel hopeless depressed`
>> [1] "feel"      "hopeless"  "depressed"
>>
>> $`feel alone depressed`
>> [1] "feel"      "alone"     "depressed"
>>
>> $`i feel helpless`
>> [1] "i"        "feel"     "helpless"
>>
>> $`i feel worthless`
>> [1] "i"         "feel"      "worthless"
>>
>> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," ","
>> "),x, c(" "," "," *$")))
>> ## function to create regexes for words when they are at the beginning,
>> middle, or end of tweets
>>
>> > wordregex <- lapply(words,expand.words)
>> ##Result
>> ## too lengthy to include
>> ##
>> > tweets <- th$text
>> ##extract the tweets
>> > findin <- function(x,y)
>>    ## x is a vector of regex patterns
>>    ## y is a character vector
>>    ## value = vector,vec, with length(vec) == length(y) and vec[i] ==
>> TRUE iff any of x matches y[i]
>> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
>> }
>>
>> ## add a matching "tweet" to the tweet vector:
>> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>>
>> > ans <-
>> lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
>> all))
>> ## Result:
>> > ans
>> $`me abused depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`me hurt depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel hopeless depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel alone depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel helpless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel worthless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>>
>> ## None of the tweets match any of the phrases except for the last tweet
>> that I added.
>>
>> ## Note: you need to add capabilities to handle upper and lower case.
>> See, e.g. ?casefold
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <[hidden email]>
>> wrote:
>>
>>> The problem wasn't the data tibbles. You posted in html -- which you
>>> were explictly warned against -- and that corrupted your text (e.g. some
>>> quotes became "smart quotes", which cannot be properly cut and pasted into
>>> R).
>>>
>>> Bert
>>>
>>>
>>> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <
>>> [hidden email]> wrote:
>>>
>>>> Argh! Here are those two example datasets as data frames (not tibbles).
>>>> Sorry again. This apparently is just not my day.
>>>>
>>>>
>>>> th <- structure(list(status_id = c("x1047841705729306624",
>>>> "x1046966595610927105",
>>>>
>>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>>>
>>>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>>
>>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>>>
>>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>>>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt
>>>> ",
>>>>
>>>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>> let me
>>>> tell you man. You are the fucking messiah",
>>>>
>>>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>>>> not
>>>> being hung over tomorrow vs. not fucking up your life ten years later.",
>>>>
>>>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>> @SenatorCollins So,  if your client was in her 20s, attending parties
>>>> with
>>>> teenagers, doesn't that make her at the least immature as hell, or at
>>>> the
>>>> worst, a pedophile and a person contributing to the delinquency of
>>>> minors?",
>>>>
>>>>
>>>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>>>
>>>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>>
>>>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>>
>>>> ), county_name = c("Cumberland County", "Delaware County", "San
>>>> Francisco
>>>> County",
>>>>
>>>> "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>> c(23005L,
>>>>
>>>>
>>>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>>
>>>> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>>
>>>>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>> c("Medium Metro",
>>>>
>>>>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>>
>>>>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>>
>>>>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>>
>>>>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA,
>>>>
>>>> -6L))
>>>>
>>>>
>>>> st <- structure(list(terms = c("me abused depressed", "me hurt
>>>> depressed",
>>>>
>>>> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>>
>>>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>>
>>>> "tbl", "data.frame"))
>>>>
>>>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <
>>>> [hidden email]>
>>>> wrote:
>>>>
>>>> > Thanks all for your patience. Here’s a second go that is perhaps more
>>>> > explicative of what it is I am trying to accomplish (and hopefully in
>>>> plain
>>>> > text form)...
>>>> >
>>>> >
>>>> > I’m using the following packages: tidyverse, purrr, tidytext
>>>> >
>>>> >
>>>> > I have a number of tweets in the following form:
>>>> >
>>>> >
>>>> > th <- structure(list(status_id = c("x1047841705729306624",
>>>> > "x1046966595610927105",
>>>> >
>>>> > "x1047094786610552832", "x1046988542818308097",
>>>> "x1046934493553221632",
>>>> >
>>>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>> >
>>>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
>>>> "2018-10-02T05:01:35Z",
>>>> >
>>>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique
>>>> is
>>>> > everything with olympic lifts ! @ Body By John
>>>> https://t.co/UsfR6DafZt",
>>>> >
>>>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>> let
>>>> > me tell you man. You are the fucking messiah",
>>>> >
>>>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>>>> not
>>>> > being hung over tomorrow vs. not fucking up your life ten years
>>>> later.",
>>>> >
>>>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>> > @SenatorCollins So, if your client was in her 20s, attending parties
>>>> with
>>>> > teenagers, doesn't that make her at the least immature as hell, or at
>>>> the
>>>> > worst, a pedophile and a person contributing to the delinquency of
>>>> minors?",
>>>> >
>>>> > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>>> >
>>>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>> >
>>>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>> >
>>>> > ), county_name = c("Cumberland County", "Delaware County", "San
>>>> Francisco
>>>> > County",
>>>> >
>>>> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>> > c(23005L,
>>>> >
>>>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>> >
>>>> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>> >
>>>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>> c("Medium
>>>> > Metro",
>>>> >
>>>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>> >
>>>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>> >
>>>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>> >
>>>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>>>> >
>>>> > ), row.names = c(NA, -6L), .internal.selfref = )
>>>> >
>>>> >
>>>> > I also have a number of search terms in the following form:
>>>> >
>>>> >
>>>> > st <- structure(list(terms = c("me abused depressed", "me hurt
>>>> depressed",
>>>> >
>>>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>> >
>>>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>> >
>>>> > "tbl", "data.frame”))
>>>> >
>>>> >
>>>> > I am trying to isolate the tweets that contain all of the words in
>>>> each of
>>>> > the search terms, i.e “me” “abused” and “depressed” from the first
>>>> example
>>>> > search term, but they do not have to be in order or even next to one
>>>> > another.
>>>> >
>>>> >
>>>> > I am familiar with the dplyr suite of tools and have been attempting
>>>> to
>>>> > generate some sort of ‘filter()’ to do this. I am not very familiar
>>>> with
>>>> > purrr, but there may be a solution using the map function? I have also
>>>> > explored the tidytext ‘unnest_tokens’ function which transforms the
>>>> ’th’
>>>> > data in the following way:
>>>> >
>>>> >
>>>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>>>> >
>>>> > > head(tt)
>>>> >
>>>> > status_id created_at lat lng
>>>> >
>>>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>> >
>>>> > county_name fips state_name state_abb urban_level urban_code
>>>> >
>>>> > 1: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > 2: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > 3: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > 4: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > 5: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > 6: Cumberland County 23005 Maine ME Medium Metro 3
>>>> >
>>>> > population word
>>>> >
>>>> > 1: 277308 technique
>>>> >
>>>> > 2: 277308 is
>>>> >
>>>> > 3: 277308 everything
>>>> >
>>>> > 4: 277308 with
>>>> >
>>>> > 5: 277308 olympic
>>>> >
>>>> > 6: 277308 lifts
>>>> >
>>>> >
>>>> > but once I have unnested the tokens, I am unable to recombine them
>>>> back
>>>> > into tweets.
>>>> >
>>>> >
>>>> > Ideally the end result would append a new column to the ‘th’ data that
>>>> > would flag a tweet that contained all of the search words for any of
>>>> the
>>>> > search terms; so the work flow would look like
>>>> >
>>>> > 1) look for all search words for one search term in a tweet
>>>> >
>>>> > 2) if all of the search words in the search term are found, create a
>>>> flag
>>>> > (mutate(flag = 1) or some such)
>>>> >
>>>> > 3) do this for all of the tweets
>>>> >
>>>> > 4) move on the next search term and repeat
>>>> >
>>>> >
>>>> > Again, my thanks for your patience.
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> >
>>>> > Nate Parsons
>>>> >
>>>> > Pronouns: He, Him, His
>>>> >
>>>> > Graduate Teaching Assistant
>>>> >
>>>> > Department of Sociology
>>>> >
>>>> > Portland State University
>>>> >
>>>> > Portland, Oregon
>>>> >
>>>> >
>>>> > 503-725-9025
>>>> >
>>>> > 503-725-3957 FAX
>>>> >
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Bert Gunter-2
Sorry. Typo.  The last line should be:

ans$Result <- apply(ans,1,function(r)phrasewords[[r[1]]] %allin%
tweets[[r[2]]])

-- Bert



On Thu, Oct 18, 2018 at 7:04 PM Bert Gunter <[hidden email]> wrote:

> All (especially Nathan): **Please feel free to ignore this post without
> response.** It just represents a bit of OCD-ness on my part that may or may
> not be of interest to anyone else.
>
> Purpose of this post: To give an alternative considerably simpler and
> considerably faster solution to the problem than those which I offered
> previously. It may or may not be what the OP asked for, but the improvement
> exercise was instructive to me . Notation as previously in this thread.
>
> New solution:
>
> getwords <-
> function(x)strsplit(gsub("(^[[:space:]]+)|([[:space:]]+)$)","",tolower(x)),split
> = " +")
> ## split lower-cased text into a vector of "words"
> ## I made this a bit fancier to handle some "corner" cases, but the
> previous simpler version may well suffice.
>
> '%allin%' <- function(x, table)prod(match(x,table, nomatch = 0L)) > 0L
> ## a convenience function/operator that improves efficiency.
>
> ## lists of  search word vectors as before
> phrasewords <- getwords(st$terms)
> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel")) ## the
> tweets + one additional
>
> ## simpler approach just using indexing for the bookkeeping that nested
> _apply
> ## loops previously were used for
> ans <- expand.grid(phrases = seq_along(phrasewords),tweets =
> seq_along(tweets), Result = FALSE)
> ans$Result <- apply(ind,1,function(r)phrasewords[[r[1]]] %allin%
> tweets[[r[2]]])
>
> ## ans is a data frame in which the first column indexes phrases and the
> second tweets
> ## The ith row of ans$Result == TRUE iff all the words in the phrase
> indexed by the ith row of the
> ##  phrase column are contained in the tweet indexed by that row's tweet
> column.
>
> This was way faster than my previous offerings.
>
> Note also that just the matching phrases and tweets can be extracted as
> usual by:
>
> > ans[ans[,3],]
>    phrases tweets Result
> 42       6      7   TRUE
> ## all the words in the 6th search phrase appeared in the 7th tweet.
>
> ** I promise to natter on about this no longer! **
>
> Cheers,
> Bert
>
>
> On Wed, Oct 17, 2018 at 7:50 PM Bert Gunter <[hidden email]>
> wrote:
>
>>
>> If you wish to use R, you need to at least understand its basic data
>> structures and functionality. Expecting that mimickry of code in special
>> packages will suffice is, I believe, an illusion. If you haven't already
>> done so, you should go through a basic R tutorial or two (there are many on
>> the web; some recommendations, by no means necessarily "the best",  can be
>> found here:
>> https://www.rstudio.com/online-learning/#r-programming).
>>
>> Having said that, I realized that my previous "solution" using regular
>> expressions was more complicated than it needed to be and somewhat foolish
>> ( so much for all my "expertise"). A simpler and better approach is simply
>> to break up both the tweet texts and your search phrases into vectors of
>> their "words" (i.e. character strings surrounded by spaces) using
>> strplit(), and then using R's built-in matching capabilities with %in%.
>> This is quite straightforward, pretty robust (no regex's to wrestle with),
>> and does not require "herculean efforts" to understand. The only wrinkle is
>> some bookkeeping with the "apply" family of functions. These are, as you
>> may know, the functional programming way of handling iteration (loops), but
>> they are what I would consider part of "basic" R functionality and worth
>> spending the time to learn about.
>>
>> Herewith my better, simpler proposal, using your example data as before:
>>
>> getwords <- function(x)strsplit(tolower(x),split = " +")
>> ## split text into a vector of lower-cased "words"
>>
>> phrasewords <- structure(getwords(st$terms), names = st$terms)
>> ## named list of your search word vectors
>>
>> tweets <- getwords(c(th$text, " i xxxx worthless yxxc ght feel"))
>> ## the tweets + one additional that should match the last phrase
>>
>> ans <- lapply(phrasewords, function(x) apply(sapply(tweets,function(y)x
>> %in% y), 2, all))
>> ## a list indexed by the search phrases,
>> ## with each component a vector of logicals with vec[i] == TRUE iff
>> ## the ith tweet contains all the words in the search phrase
>>
>> > ans
>> $`me abused depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`me hurt depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel hopeless depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`feel alone depressed`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel helpless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>
>> $`i feel worthless`
>> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>>
>> -- Bert
>>
>> On Wed, Oct 17, 2018 at 9:20 AM Nathan Parsons <
>> [hidden email]> wrote:
>>
>>> I do not have your command of base r, Bert. That is a herculean effort!
>>> Here’s what I spent my night putting together:
>>>
>>> ## Create search terms
>>> ## dput(st)
>>> st <- structure(list(word1 = c("technique", "me", "me", "feel", "feel"
>>> ), word2 = c("olympic", "abused", "hurt", "hopeless", "alone"
>>> ), word3 = c("lifts", "depressed", "depressed", "depressed",
>>> "depressed")), class = c("tbl_df", "tbl", "data.frame"), row.names =
>>> c(NA,
>>> -5L))
>>>
>>> ## Create tweets
>>> ## dput(th)
>>> th <- structure(list(status_id = c("x1047841705729306624",
>>> "x1046966595610927105",
>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>> "x1047227442899775488", "x1048126008941981696", "x1047798782673543173",
>>> "x1048269727582355457", "x1048092408544677890"), created_at =
>>> c("2018-10-04T13:31:45Z",
>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z", "2018-10-05T08:21:28Z",
>>> "2018-10-04T10:41:11Z", "2018-10-05T17:52:33Z", "2018-10-05T06:07:57Z"
>>> ), text = c("technique is everything with olympic lifts ! @ body by john
>>> ",
>>> "@subtronics just went back and rewatched ur fblice with ur cdjs and let
>>> me tell you man. you are the fucking messiah",
>>> "@ic4rus1 opportunistic means short-game. as in getting drunk now vs.
>>> not being hung over tomorrow vs. not fucking up your life ten years later.",
>>> "i tend to think about my dreams before i sleep.", "@michaelavenatti
>>> @senatorcollins so if your client was in her 20s attending parties with
>>> teenagers doesnt that make her at the least immature as hell or at the
>>> worst a pedophile and a person contributing to the delinquency of minors?",
>>> "i wish i could take credit for this", "i woulda never imagined.
>>> #lakeshow ",
>>> "@philipbloom @blackmagic_news its ok phil! i feel your pain! ",
>>> "sunday ill have a booth in katy at the real craft wives of katy fest
>>> @nolabelbrewco cmon yall!everything is better when you top it with
>>> tias!order today we ship to all 50 ",
>>> "dolly is so baddd"), lat = c(43.6835853, 40.284123, 37.7706565,
>>> 40.431389, 31.1688935, 33.9376735, 34.0207895, 44.900818, 29.7926,
>>> 32.364145), lng = c(-70.3284118, -83.078589, -122.4359785, -79.9806895,
>>> -100.0768885, -118.130426, -118.4119065, -89.5694915, -95.8224,
>>> -86.2447285), county_name = c("Cumberland County", "Delaware County",
>>> "San Francisco County", "Allegheny County", "Concho County",
>>> "Los Angeles County", "Los Angeles County", "Marathon County",
>>> "Harris County", "Montgomery County"), fips = c(23005L, 39041L,
>>> 6075L, 42003L, 48095L, 6037L, 6037L, 55073L, 48201L, 1101L),
>>> state_name = c("Maine", "Ohio", "California", "Pennsylvania",
>>> "Texas", "California", "California", "Wisconsin", "Texas",
>>> "Alabama"), state_abb = c("ME", "OH", "CA", "PA", "TX", "CA",
>>> "CA", "WI", "TX", "AL"), urban_level = c("Medium Metro",
>>> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>> "NonCore (Nonmetro)", "Large Central Metro", "Large Central Metro",
>>> "Small Metro", "Large Central Metro", "Medium Metro"), urban_code = c(3L,
>>> 2L, 1L, 1L, 6L, 1L, 1L, 4L, 1L, 3L), population = c(277308L,
>>> 184029L, 830781L, 1160433L, 4160L, 9509611L, 9509611L, 127612L,
>>> 4233913L, 211037L), linenumber = 1:10), row.names = c(NA,
>>> 10L), class = "data.frame")
>>>
>>> ## Clean tweets - basically just remove everything we don’t need from
>>> the text including punctuation and urls
>>> th %>%
>>> mutate(linenumber = row_number(),
>>> text = str_remove_all(text, "[^\x01-\x7F]"),
>>> text = str_remove_all(text, "\n"),
>>> text = str_remove_all(text, ","),
>>> text = str_remove_all(text, "'"),
>>> text = str_remove_all(text, "&"),
>>> text = str_remove_all(text, "<"),
>>> text = str_remove_all(text, ">"),
>>> text = str_remove_all(text, "http[s]?://[[:alnum:].\\/]+"),
>>> text = tolower(text)) -> th
>>>
>>> ## Create search function that looks for each search term in the
>>> provided string, evaluates if all three search terms have been found, and
>>> returns a logical
>>> srchr <- function(df) {
>>> str_detect(df, "olympic") -> a
>>> str_detect(df, "technique") -> b
>>> str_detect(df, "lifts") -> c
>>> ifelse(a == TRUE & b == TRUE & c == TRUE, TRUE, FALSE)
>>> }
>>>
>>> ## Evaluate tweets for presence of search term
>>> th %>%
>>> mutate(flag = map_chr(text, srchr)) -> th_flagged
>>>
>>> As far as I can tell, this works. I have to manually enter each set of
>>> search terms into the function, which is not ideal. Also, this only
>>> generates a True/False for each tweet based on one search term - I end up
>>> with an evaluatory column for each search term that I would then have to
>>> collapse together somehow. I’m sure there’s a more elegant solution.
>>>
>>> --
>>>
>>> Nate Parsons
>>> Pronouns: He, Him, His
>>> Graduate Teaching Assistant
>>> Department of Sociology
>>> Portland State University
>>> Portland, Oregon
>>>
>>> 503-725-9025
>>> 503-725-3957 FAX
>>> On Oct 16, 2018, 7:20 PM -0700, Bert Gunter <[hidden email]>,
>>> wrote:
>>>
>>> OK, as no one else has offered a solution, I'll take a whack at it.
>>>
>>> Caveats: This is a brute force attempt using R's basic regular
>>> expression engine. It is inelegant and barely tested, so likely to be at
>>> best incomplete and buggy, and at worst, incorrect. But maybe Nathan or
>>> someone else on the list can fix it up. So if (when) it breaks, complain on
>>> the list to give someone (almost certainly not me) the opportunity.
>>>
>>> The basic idea is that the tweets are just character strings and the
>>> search phrases are just character vectors all of whose elements must match
>>> "appropriately" -- i.e. they must match whole words -- in the character
>>> strings. So my desired output from the code is a list indexed by the search
>>> phrases, each of whose components if a logical vector of length the number
>>> of tweets each of whose elements = TRUE iff all the words in the search
>>> phrase match somewhere in the tweet.
>>>
>>> Here's the code(using the data Nathan provided):
>>>
>>> > words <- sapply(st[[1]],strsplit,split = " +" )
>>> ## convert the phrases to a list of character vectors of the words
>>> ## Result:
>>> > words
>>> $`me abused depressed`
>>> [1] "me"        "abused"    "depressed"
>>>
>>> $`me hurt depressed`
>>> [1] "me"        "hurt"      "depressed"
>>>
>>> $`feel hopeless depressed`
>>> [1] "feel"      "hopeless"  "depressed"
>>>
>>> $`feel alone depressed`
>>> [1] "feel"      "alone"     "depressed"
>>>
>>> $`i feel helpless`
>>> [1] "i"        "feel"     "helpless"
>>>
>>> $`i feel worthless`
>>> [1] "i"         "feel"      "worthless"
>>>
>>> > expand.words <-  function(z)lapply(z,function(x)paste0(c("^ *"," ","
>>> "),x, c(" "," "," *$")))
>>> ## function to create regexes for words when they are at the beginning,
>>> middle, or end of tweets
>>>
>>> > wordregex <- lapply(words,expand.words)
>>> ##Result
>>> ## too lengthy to include
>>> ##
>>> > tweets <- th$text
>>> ##extract the tweets
>>> > findin <- function(x,y)
>>>    ## x is a vector of regex patterns
>>>    ## y is a character vector
>>>    ## value = vector,vec, with length(vec) == length(y) and vec[i] ==
>>> TRUE iff any of x matches y[i]
>>> { apply(sapply(x,function(z)grepl(z,y)), 1,any)
>>> }
>>>
>>> ## add a matching "tweet" to the tweet vector:
>>> > tweets <- c(tweets," i xxxx worthless yxxc ght feel")
>>>
>>> > ans <-
>>> lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
>>> all))
>>> ## Result:
>>> > ans
>>> $`me abused depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`me hurt depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`feel hopeless depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`feel alone depressed`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`i feel helpless`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>
>>> $`i feel worthless`
>>> [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
>>>
>>> ## None of the tweets match any of the phrases except for the last tweet
>>> that I added.
>>>
>>> ## Note: you need to add capabilities to handle upper and lower case.
>>> See, e.g. ?casefold
>>>
>>> Cheers,
>>> Bert
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <[hidden email]>
>>> wrote:
>>>
>>>> The problem wasn't the data tibbles. You posted in html -- which you
>>>> were explictly warned against -- and that corrupted your text (e.g. some
>>>> quotes became "smart quotes", which cannot be properly cut and pasted into
>>>> R).
>>>>
>>>> Bert
>>>>
>>>>
>>>> On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <
>>>> [hidden email]> wrote:
>>>>
>>>>> Argh! Here are those two example datasets as data frames (not tibbles).
>>>>> Sorry again. This apparently is just not my day.
>>>>>
>>>>>
>>>>> th <- structure(list(status_id = c("x1047841705729306624",
>>>>> "x1046966595610927105",
>>>>>
>>>>> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>>>>>
>>>>> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>>>
>>>>> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>>>>>
>>>>> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
>>>>> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt
>>>>> ",
>>>>>
>>>>> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>>> let me
>>>>> tell you man. You are the fucking messiah",
>>>>>
>>>>> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
>>>>> not
>>>>> being hung over tomorrow vs. not fucking up your life ten years
>>>>> later.",
>>>>>
>>>>> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>>> @SenatorCollins So,  if your client was in her 20s, attending parties
>>>>> with
>>>>> teenagers, doesn't that make her at the least immature as hell, or at
>>>>> the
>>>>> worst, a pedophile and a person contributing to the delinquency of
>>>>> minors?",
>>>>>
>>>>>
>>>>> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>>>>>
>>>>> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>>>
>>>>> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>>>
>>>>> ), county_name = c("Cumberland County", "Delaware County", "San
>>>>> Francisco
>>>>> County",
>>>>>
>>>>> "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>>> c(23005L,
>>>>>
>>>>>
>>>>> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>>>
>>>>> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>>>
>>>>>     state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>>> c("Medium Metro",
>>>>>
>>>>>     "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>>>
>>>>>     "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>>>
>>>>>     2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>>>
>>>>>     1160433L, 4160L, 9509611L)), class = "data.frame", row.names =
>>>>> c(NA,
>>>>>
>>>>> -6L))
>>>>>
>>>>>
>>>>> st <- structure(list(terms = c("me abused depressed", "me hurt
>>>>> depressed",
>>>>>
>>>>> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>>>
>>>>> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>>>
>>>>> "tbl", "data.frame"))
>>>>>
>>>>> On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> > Thanks all for your patience. Here’s a second go that is perhaps more
>>>>> > explicative of what it is I am trying to accomplish (and hopefully
>>>>> in plain
>>>>> > text form)...
>>>>> >
>>>>> >
>>>>> > I’m using the following packages: tidyverse, purrr, tidytext
>>>>> >
>>>>> >
>>>>> > I have a number of tweets in the following form:
>>>>> >
>>>>> >
>>>>> > th <- structure(list(status_id = c("x1047841705729306624",
>>>>> > "x1046966595610927105",
>>>>> >
>>>>> > "x1047094786610552832", "x1046988542818308097",
>>>>> "x1046934493553221632",
>>>>> >
>>>>> > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>>>>> >
>>>>> > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
>>>>> "2018-10-02T05:01:35Z",
>>>>> >
>>>>> > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique
>>>>> is
>>>>> > everything with olympic lifts ! @ Body By John
>>>>> https://t.co/UsfR6DafZt",
>>>>> >
>>>>> > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
>>>>> let
>>>>> > me tell you man. You are the fucking messiah",
>>>>> >
>>>>> > "@ic4rus1 Opportunistic means short-game. As in getting drunk now
>>>>> vs. not
>>>>> > being hung over tomorrow vs. not fucking up your life ten years
>>>>> later.",
>>>>> >
>>>>> > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
>>>>> > @SenatorCollins So, if your client was in her 20s, attending parties
>>>>> with
>>>>> > teenagers, doesn't that make her at the least immature as hell, or
>>>>> at the
>>>>> > worst, a pedophile and a person contributing to the delinquency of
>>>>> minors?",
>>>>> >
>>>>> > "i wish i could take credit for this"), lat = c(43.6835853,
>>>>> 40.284123,
>>>>> >
>>>>> > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>>>>> >
>>>>> > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>>>>> >
>>>>> > ), county_name = c("Cumberland County", "Delaware County", "San
>>>>> Francisco
>>>>> > County",
>>>>> >
>>>>> > "Allegheny County", "Concho County", "Los Angeles County"), fips =
>>>>> > c(23005L,
>>>>> >
>>>>> > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>>>>> >
>>>>> > "Ohio", "California", "Pennsylvania", "Texas", "California"),
>>>>> >
>>>>> > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level =
>>>>> c("Medium
>>>>> > Metro",
>>>>> >
>>>>> > "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>>>>> >
>>>>> > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>>>>> >
>>>>> > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>>>>> >
>>>>> > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>>>>> >
>>>>> > ), row.names = c(NA, -6L), .internal.selfref = )
>>>>> >
>>>>> >
>>>>> > I also have a number of search terms in the following form:
>>>>> >
>>>>> >
>>>>> > st <- structure(list(terms = c("me abused depressed", "me hurt
>>>>> depressed",
>>>>> >
>>>>> > "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>>>>> >
>>>>> > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>>>>> >
>>>>> > "tbl", "data.frame”))
>>>>> >
>>>>> >
>>>>> > I am trying to isolate the tweets that contain all of the words in
>>>>> each of
>>>>> > the search terms, i.e “me” “abused” and “depressed” from the first
>>>>> example
>>>>> > search term, but they do not have to be in order or even next to one
>>>>> > another.
>>>>> >
>>>>> >
>>>>> > I am familiar with the dplyr suite of tools and have been attempting
>>>>> to
>>>>> > generate some sort of ‘filter()’ to do this. I am not very familiar
>>>>> with
>>>>> > purrr, but there may be a solution using the map function? I have
>>>>> also
>>>>> > explored the tidytext ‘unnest_tokens’ function which transforms the
>>>>> ’th’
>>>>> > data in the following way:
>>>>> >
>>>>> >
>>>>> > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>>>>> >
>>>>> > > head(tt)
>>>>> >
>>>>> > status_id created_at lat lng
>>>>> >
>>>>> > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>>>>> >
>>>>> > county_name fips state_name state_abb urban_level urban_code
>>>>> >
>>>>> > 1: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 2: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 3: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 4: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 5: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > 6: Cumberland County 23005 Maine ME Medium Metro 3
>>>>> >
>>>>> > population word
>>>>> >
>>>>> > 1: 277308 technique
>>>>> >
>>>>> > 2: 277308 is
>>>>> >
>>>>> > 3: 277308 everything
>>>>> >
>>>>> > 4: 277308 with
>>>>> >
>>>>> > 5: 277308 olympic
>>>>> >
>>>>> > 6: 277308 lifts
>>>>> >
>>>>> >
>>>>> > but once I have unnested the tokens, I am unable to recombine them
>>>>> back
>>>>> > into tweets.
>>>>> >
>>>>> >
>>>>> > Ideally the end result would append a new column to the ‘th’ data
>>>>> that
>>>>> > would flag a tweet that contained all of the search words for any of
>>>>> the
>>>>> > search terms; so the work flow would look like
>>>>> >
>>>>> > 1) look for all search words for one search term in a tweet
>>>>> >
>>>>> > 2) if all of the search words in the search term are found, create a
>>>>> flag
>>>>> > (mutate(flag = 1) or some such)
>>>>> >
>>>>> > 3) do this for all of the tweets
>>>>> >
>>>>> > 4) move on the next search term and repeat
>>>>> >
>>>>> >
>>>>> > Again, my thanks for your patience.
>>>>> >
>>>>> >
>>>>> > --
>>>>> >
>>>>> >
>>>>> > Nate Parsons
>>>>> >
>>>>> > Pronouns: He, Him, His
>>>>> >
>>>>> > Graduate Teaching Assistant
>>>>> >
>>>>> > Department of Sociology
>>>>> >
>>>>> > Portland State University
>>>>> >
>>>>> > Portland, Oregon
>>>>> >
>>>>> >
>>>>> > 503-725-9025
>>>>> >
>>>>> > 503-725-3957 FAX
>>>>> >
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Matching multiple search criteria (Unlisting a nested dataset, take 2)

Ista Zahn
In reply to this post by Nathan Parsons
Here is another approach, just for fun:

library(tidyverse)
library(tokenizers)

anyall <- function(x, # a character vector
                   terms # a list of character vectors
                   ){
    any(map_lgl(terms, function(term) {
        all(term %in% x)
    }))
}

mutate(th,
       flag = map_lgl(tokenize_tweets(text),
                      anyall,
                      terms = tokenize_words(st$terms)))

Best,
Ista
On Tue, Oct 16, 2018 at 5:39 PM Nathan Parsons
<[hidden email]> wrote:

>
> Thanks all for your patience. Here’s a second go that is perhaps more
> explicative of what it is I am trying to accomplish (and hopefully in plain
> text form)...
>
>
> I’m using the following packages: tidyverse, purrr, tidytext
>
>
> I have a number of tweets in the following form:
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097", "x1046934493553221632",
>
> "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is
> everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
> tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
> being hung over tomorrow vs. not fucking up your life ten years later.",
>
> "I tend to think about my dreams before I sleep.", "@MichaelAvenatti
> @SenatorCollins So, if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at the
> worst, a pedophile and a person contributing to the delinquency of minors?",
>
> "i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles County"), fips = c(23005L,
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania", "Texas", "California"),
>
> state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = c("Medium
> Metro",
>
> "Large Fringe Metro", "Large Central Metro", "Large Central Metro",
>
> "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L,
>
> 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
> 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame"
>
> ), row.names = c(NA, -6L), .internal.selfref = )
>
>
> I also have a number of search terms in the following form:
>
>
> st <- structure(list(terms = c("me abused depressed", "me hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed", "i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df",
>
> "tbl", "data.frame”))
>
>
> I am trying to isolate the tweets that contain all of the words in each of
> the search terms, i.e “me” “abused” and “depressed” from the first example
> search term, but they do not have to be in order or even next to one
> another.
>
>
> I am familiar with the dplyr suite of tools and have been attempting to
> generate some sort of ‘filter()’ to do this. I am not very familiar with
> purrr, but there may be a solution using the map function? I have also
> explored the tidytext ‘unnest_tokens’ function which transforms the ’th’
> data in the following way:
>
>
> > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt
>
> > head(tt)
>
> status_id created_at lat lng
>
> 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> county_name fips state_name state_abb urban_level urban_code
>
> 1: Cumberland County 23005 Maine ME Medium Metro 3
>
> 2: Cumberland County 23005 Maine ME Medium Metro 3
>
> 3: Cumberland County 23005 Maine ME Medium Metro 3
>
> 4: Cumberland County 23005 Maine ME Medium Metro 3
>
> 5: Cumberland County 23005 Maine ME Medium Metro 3
>
> 6: Cumberland County 23005 Maine ME Medium Metro 3
>
> population word
>
> 1: 277308 technique
>
> 2: 277308 is
>
> 3: 277308 everything
>
> 4: 277308 with
>
> 5: 277308 olympic
>
> 6: 277308 lifts
>
>
> but once I have unnested the tokens, I am unable to recombine them back
> into tweets.
>
>
> Ideally the end result would append a new column to the ‘th’ data that
> would flag a tweet that contained all of the search words for any of the
> search terms; so the work flow would look like
>
> 1) look for all search words for one search term in a tweet
>
> 2) if all of the search words in the search term are found, create a flag
> (mutate(flag = 1) or some such)
>
> 3) do this for all of the tweets
>
> 4) move on the next search term and repeat
>
>
> Again, my thanks for your patience.
>
>
> --
>
>
> Nate Parsons
>
> Pronouns: He, Him, His
>
> Graduate Teaching Assistant
>
> Department of Sociology
>
> Portland State University
>
> Portland, Oregon
>
>
> 503-725-9025
>
> 503-725-3957 FAX
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.