POS tagging generating a string

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

POS tagging generating a string

R help mailing list-2
Hi all,

In my df I would like to generate a new column which contains a string showing all the verbs in each row of df$Message.



> library(openNLP)
> library(NLP)
> dput(df)
structure(list(DocumentID = c(478920L, 510133L, 499497L, 930234L
), Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much for your nice feedback.\n",
"THank you, added it", "Thanks for the well explained article.",
"The solution has been updated"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))

tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}



Any help?
Thanks in advance!
Elahe

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: POS tagging generating a string

R help mailing list-2
Hi Elahe,
You could modify your count_verbs function from your previous post:

  * use scan to extract the tokens (words) from Message
  * use your previous grepl expression to index the tokens that are verbs
  * paste the verbs together to form the entries of a new column.

Here is one solution:

 >>>>>>>>>>>>>>>
library(openNLP)
library(NLP)

df <- data.frame(DocumentID = c(478920L, 510133L, 499497L, 930234L),
                  Message = structure(c(4L, 2L, 3L, 1L), .Label =
c("Thank you very much for your nice feedback.\n",
"THank you, added it", "Thanks for the well explained article.",
"The solution has been updated"), class = "factor"))


dput(df)

tagPOS <-  function(x, ...) {
   s <- as.String(x)
   if(s=="") return(list())
   word_token_annotator <- Maxent_Word_Token_Annotator()
   a2 <- Annotation(1L, "sentence", 1L, nchar(s))
   a2 <- annotate(s, word_token_annotator, a2)
   a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
   a3w <- a3[a3$type == "word"]
   POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
   POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
   list(POStagged = POStagged, POStags = POStags)
}

verbs <-function(x) {
   tagPOSx <- tagPOS(x)
   scanx <- scan(text=as.character(x), what="character")
   n <- length(scanx)
   paste(scanx[(1:n)[grepl("VB", tagPOSx$POStags)]], collapse="|")
}

library(dplyr)

df %>% group_by(DocumentID) %>% summarise(verbs = verbs(Message))
<<<<<<<<<<<<<<<<<<<<<

I'll leave it to you to extract a column of verbs from the result and
rbind it to the original data.frame.

Btw, I don't this solution is efficient, I would guess that the
processing that scan does in the verbs function is duplicating work
already done in the tagPOS function by annotate, so you may want to
return a list of tokens from tagPOS and use that instead of scan.

Rgds,
Robert

On 06/11/18 10:26, Elahe chalabi via R-help wrote:

> Hi all, In my df I would like to generate a new column which contains
> a string showing all the verbs in each row of df$Message.
>> library(openNLP) library(NLP) dput(df)
> structure(list(DocumentID = c(478920L, 510133L, 499497L, 930234L ),
> Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much
> for your nice feedback.\n", "THank you, added it", "Thanks for the
> well explained article.", "The solution has been updated"), class =
> "factor")), class = "data.frame", row.names = c(NA, -4L)) tagPOS <-
> function(x, ...) { s <- as.String(x) word_token_annotator <-
> Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L,
> nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <-
> annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type ==
> "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged
> <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
> list(POStagged = POStagged, POStags = POStags) } Any help? Thanks in
> advance! Elahe ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> posting guide http://www.R-project.org/posting-guide.html and provide
> commented, minimal, self-contained, reproducible code.




        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: POS tagging generating a string

R help mailing list-2
In reply to this post by R help mailing list-2
On 13/11/2018 12:31, Elahe chalabi wrote:

> Hi Robert,
>
> Thanks for your reply but your code returns the number of verbs in each massage. What I want is a string showing verbs in each massage.
>
The output of my code (below) is:

# A tibble: 4 x 2
   DocumentID verbs
        <int> <chr>
1     478920 has|been|updated
2     499497 explained
3     510133 it
4     930234 Thank

Is this not what you wanted?

Rgds,

Robert

> On Wednesday, November 7, 2018 7:31 AM, Robert David Burbidge <[hidden email]> wrote:
>
>
>
> Hi Elahe,
> You could modify your count_verbs function from your previous post:
>      * use scan to extract the tokens (words) from Message
>      * use your previous grepl expression to index the tokens that are verbs
>      * paste the verbs together to form the entries of a new column.Here is one solution:
>
> library(openNLP)
> library(NLP)
>
> df <- data.frame(DocumentID = c(478920L, 510133L, 499497L, 930234L),
>                   Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much for your nice feedback.\n",
>                                                                     "THank you, added it", "Thanks for the well explained article.",
>                                                                     "The solution has been updated"), class = "factor"))
>
>
> dput(df)
>
> tagPOS <-  function(x, ...) {
>    s <- as.String(x)
>    if(s=="") return(list())
>    word_token_annotator <- Maxent_Word_Token_Annotator()
>    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
>    a2 <- annotate(s, word_token_annotator, a2)
>    a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
>    a3w <- a3[a3$type == "word"]
>    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
>    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
>    list(POStagged = POStagged, POStags = POStags)
> }
>
> verbs <-function(x) {
>    tagPOSx <- tagPOS(x)
>    scanx <- scan(text=as.character(x), what="character")
>    n <- length(scanx)
>    paste(scanx[(1:n)[grepl("VB", tagPOSx$POStags)]], collapse="|")
> }
>
> library(dplyr)
>
> df %>% group_by(DocumentID) %>% summarise(verbs = verbs(Message))
> <<<<<<<<<<<<<<<<<<<<<
>
> I'll leave it to you to extract a column of verbs from the result
>      and rbind it to the original data.frame.
>
> Btw, I don't this solution is efficient, I would guess that the
>      processing that scan does in the verbs function is duplicating
>      work already done in the tagPOS function by annotate, so you may
>      want to return a list of tokens from tagPOS and use that instead
>      of scan.
>
> Rgds,
> Robert
>
>
> On 06/11/18 10:26, Elahe chalabi via R-help wrote:
>
> Hi all, In my df I would like to generate a new column which contains a string showing all the verbs in each row of df$Message.
>> library(openNLP) library(NLP) dput(df) structure(list(DocumentID = c(478920L, 510133L, 499497L, 930234L ), Message = structure(c(4L, 2L, 3L, 1L), .Label = c("Thank you very much for your nice feedback.\n", "THank you, added it", "Thanks for the well explained article.", "The solution has been updated"), class = "factor")), class = "data.frame", row.names = c(NA, -4L)) tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) } Any help? Thanks in advance! Elahe ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.