Quantcast

NGramTokenizer not working as expected

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

NGramTokenizer not working as expected

au_anish
I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong?

install.packages("xlsx")
install.packages("tm")
install.packages("wordcloud")
install.packages("ggplot2")

library(xlsx)
library(tm)
library(wordcloud)
library(ggplot2)

setwd("C://Users//608447283//desktop//R_word_charts")


test <- Corpus(DirSource"C://Users//608447283//desktop//R_word_charts//source"))

test <- tm_map(test, stripWhitespace)
test <- tm_map(test, tolower)
test <- tm_map(test, removeWords,stopwords("english"))
test <- tm_map(test, removePunctuation)
test <- tm_map(test, PlainTextDocument)

tok <- function(x) NGramTokenizer(x, Weka_control(min=3, max=10))
tdm <- TermDocumentMatrix(test,control = list(tokenize = tok))
termFreq <- rowSums(as.matrix(tdm))

termFreq <- subset(termFreq, termFreq>=10)

write.csv(termFreq,file="TestCSV1")
TestCSV <- read.csv("C:/Users/608447283/Desktop/R_word_charts/TestCSV1")

ggplot(data=TestCSV, aes(x=X, y=x)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

My output:


Source data: https://www.dropbox.com/s/4v29v5x868yqktw/sample%20data.txt?dl=0
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NGramTokenizer not working as expected

au_anish
I got it to work. The issue seems to be with the latest version of tm package (0.7). It worked like a charm when I used version 0.6-2
Loading...