Quantcast

Help with stemDocument

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Help with stemDocument

Deborah H. Deng
Hi, All:

I am new to R and tm package. I'm trying to do the stemming using tm_map()
and it doesn't seem to work:

*I used:*
> stemDocument(t_cmts[[100]])

*Where t_cmts is the corpus object, the results is:*
 bottle loose box abt airpak sections top plastic bottle squashed nearly
flush neck previous shipments bottle wrapped securely bubble wrap wno
bottle damage packaging poor surprisingly bottle leaking remove contents
bottle reusable packaging cancel automatic shipments
>

Which doesn't seem to have any stemming done at all. *What did I do wrong*?

I have rWeka, tm, rJava, Snowball installed (Use "install package" from the
top menu and it didn't say it failed.)

Thanks,
Deborah

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Alekseiy Beloshitskiy
Check this
slideshare.net/whitish/textmining-with-r

Best,
-Alex
________________________________________
From: [hidden email] [[hidden email]] on behalf of Deborah H. Deng [[hidden email]]
Sent: 13 April 2012 10:27
To: [hidden email]
Subject: [R] Help with stemDocument

Hi, All:

I am new to R and tm package. I'm trying to do the stemming using tm_map()
and it doesn't seem to work:

*I used:*
> stemDocument(t_cmts[[100]])

*Where t_cmts is the corpus object, the results is:*
 bottle loose box abt airpak sections top plastic bottle squashed nearly
flush neck previous shipments bottle wrapped securely bubble wrap wno
bottle damage packaging poor surprisingly bottle leaking remove contents
bottle reusable packaging cancel automatic shipments
>

Which doesn't seem to have any stemming done at all. *What did I do wrong*?

I have rWeka, tm, rJava, Snowball installed (Use "install package" from the
top menu and it didn't say it failed.)

Thanks,
Deborah

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Triss.Ashton
I am having a problem with stemDocuments also.  I can make it work by moving the data into a Corpus by using:

>  a <- Corpus(VectorSource(df$text)) # create corpus object
>  a <- tm_map(a, stemDocument, language = "english")

but it is horrably slow.  I want to stem outside the Corpus object like:

>df$text <- stemDocument(df$text, language = "english")

but it returns the original text.  

In fact, using the example in the tm package documentation does not work either:

> data("crude")
> crude[[1]]
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]], language = "english") # specify language
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]]) # language not specified
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Alekseiy Beloshitskiy
Hi Triss,

If you need to stem just one text in the Corupus use a[[n]]<-stemDocument

Best,
-Alex
________________________________________
From: [hidden email] [[hidden email]] on behalf of Triss.Ashton [[hidden email]]
Sent: 02 May 2012 21:09
To: [hidden email]
Subject: Re: [R] Help with stemDocument

I am having a problem with stemDocuments also.  I can make it work by moving
the data into a Corpus by using:

>  a <- Corpus(VectorSource(df$text)) # create corpus object
>  a <- tm_map(a, stemDocument, language = "english")

but it is horrably slow.  I want to stem outside the Corpus object like:

>df$text <- stemDocument(df$text, language = "english")

but it returns the original text.

In fact, using the example in the tm package documentation does not work
either:

> data("crude")
> crude[[1]]
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]], language = "english") # specify language
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]]) # language not specified
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
>


--
View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Triss.Ashton
In reply to this post by Deborah H. Deng
Did you ever get this to work? I am also having a problem with stemDocument and removeWords. I think it is an issue with R 2.15 or the TM package refresh because I can get everything to run under R2.10.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Triss.Ashton
In reply to this post by Alekseiy Beloshitskiy
Alekseiy, I tried your recommendation with several variations. It still does not run.  I think the problem has to do with R2.15 and the refreshed TM package.  Everything runs under R2.10 with the following code:

a <- Corpus(VectorSource(df$text)) # create corpus object
a <- tm_map(a, removePunctuation)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stripWhitespace)
a <- tm_map(a, stemDocument, language = "english")


This same code ran on R2.15 results in:
1. the removeWords working sometimes, and sometimes not.
2. and stemDocuments absolutely not working.  

Both error out.  removeWords always stops reading in the stopword list on the same line number  (I have added and subtracted words - no difference) - session info is below:

> a <- tm_map(a, removeWords, stopwords("english"))

Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  :
  invalid regular expression '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he


> a <- tm_map(a, stemDocument, language = "english")
Error in .jnew(name) : java.lang.ClassNotFoundException

SessionInfo:

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets
[8] methods   base    

other attached packages:
 [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
 [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8  
 [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
[13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1  

loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17      
 [5] memoise_0.1        munsell_0.3        proto_0.3-9.2      RColorBrewer_1.0-5
 [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0      
> <quote author="Alekseiy Beloshitskiy">
Hi Triss,

If you need to stem just one text in the Corupus use a[[n]]<-stemDocument

Best,
-Alex
________________________________________
From: [hidden email] [[hidden email]] on behalf of Triss.Ashton [[hidden email]]
Sent: 02 May 2012 21:09
To: [hidden email]
Subject: Re: [R] Help with stemDocument

I am having a problem with stemDocuments also.  I can make it work by moving
the data into a Corpus by using:

>  a <- Corpus(VectorSource(df$text)) # create corpus object
>  a <- tm_map(a, stemDocument, language = "english")

but it is horrably slow.  I want to stem outside the Corpus object like:

>df$text <- stemDocument(df$text, language = "english")

but it returns the original text.

In fact, using the example in the tm package documentation does not work
either:

> data("crude")
> crude[[1]]
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]], language = "english") # specify language
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]]) # language not specified
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
>


--
View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
Sent from the R help mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Milan Bouchet-Valat
Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit :
> Alekseiy, I tried your recommendation with several variations. It still does
> not run.  I think the problem has to do with R2.15 and the refreshed TM
> package.
It works here with R 2.15.0 and tm 0.5-7.2 (development version), all
other relevant packages of the same version as you (but on Linux 64
bits). So it might not be the problem.

I'm using the docs example as a test:
data("crude")
crude[[1]]
stemDocument(crude[[1]])

> Everything runs under R2.10 with the following code:
>
> a <- Corpus(VectorSource(df$text)) # create corpus object
> a <- tm_map(a, removePunctuation)
> a <- tm_map(a, removeNumbers)
> a <- tm_map(a, removeWords, stopwords("english"))
> a <- tm_map(a, stripWhitespace)
> a <- tm_map(a, stemDocument, language = "english")
Let's focus on the example from the docs, since it's simple. Anyway, you
example is not reproducible since you do not provide the original data.

>
> This same code ran on R2.15 results in:
> 1. the removeWords working sometimes, and sometimes not.
> 2. and stemDocuments absolutely not working.  
>
> Both error out.  removeWords always stops reading in the stopword list on
> the same line number  (I have added and subtracted words - no difference) -
> session info is below:
>
> > a <- tm_map(a, removeWords, stopwords("english"))
>
> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  :
>   invalid regular expression
> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
>
>
> > a <- tm_map(a, stemDocument, language = "english")
> Error in .jnew(name) : java.lang.ClassNotFoundException
This error suggests you should reconfigure Java. Have you tried
reinstalling rJava, Snowball, RWekajars and RWeka?

> SessionInfo:
>
> > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252  
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                          
> [5] LC_TIME=English_United States.1252    
>
> attached base packages:
> [1] stats4    grid      stats     graphics  grDevices utils     datasets
> [8] methods   base    
>
> other attached packages:
>  [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
>  [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8  
>  [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
> [13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1  
>
> loaded via a namespace (and not attached):
>  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17      
>  [5] memoise_0.1        munsell_0.3        proto_0.3-9.2    
> RColorBrewer_1.0-5
>  [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0      
> >
> Hi Triss,
>
> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument
>
> Best,
> -Alex
> ________________________________________
> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
> [triss.ashton@]
> Sent: 02 May 2012 21:09
> To: r-help@
> Subject: Re: [R] Help with stemDocument
>
> I am having a problem with stemDocuments also.  I can make it work by moving
> the data into a Corpus by using:
>
> >  a <- Corpus(VectorSource(df$text)) # create corpus object
> >  a <- tm_map(a, stemDocument, language = "english")
>
> but it is horrably slow.  I want to stem outside the Corpus object like:
>
> >df$text <- stemDocument(df$text, language = "english")
>
> but it returns the original text.
>
> In fact, using the example in the tm package documentation does not work
> either:
>
> > data("crude")
> > crude[[1]]
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]], language = "english") # specify language
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]]) # language not specified
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> >
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
> Sent from the R help mailing list archive at Nabble.com.
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Help with stemDocument

Triss.Ashton
Thanks Milan, it is running now.  It seems part of the problem, as you suggested were the packages.  It seems that although I just installed Rweka, Snowball and the like they were out of date.  So updataing fixed stemDocument. As for removeWords, that began working once I cut my data in half.  Apparently there are some memory management issues I have yet to figure out.  Thanks again for the help.

Triss


Milan Bouchet-Valat wrote
Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit :
> Alekseiy, I tried your recommendation with several variations. It still does
> not run.  I think the problem has to do with R2.15 and the refreshed TM
> package.
It works here with R 2.15.0 and tm 0.5-7.2 (development version), all
other relevant packages of the same version as you (but on Linux 64
bits). So it might not be the problem.

I'm using the docs example as a test:
data("crude")
crude[[1]]
stemDocument(crude[[1]])

> Everything runs under R2.10 with the following code:
>
> a <- Corpus(VectorSource(df$text)) # create corpus object
> a <- tm_map(a, removePunctuation)
> a <- tm_map(a, removeNumbers)
> a <- tm_map(a, removeWords, stopwords("english"))
> a <- tm_map(a, stripWhitespace)
> a <- tm_map(a, stemDocument, language = "english")
Let's focus on the example from the docs, since it's simple. Anyway, you
example is not reproducible since you do not provide the original data.

>
> This same code ran on R2.15 results in:
> 1. the removeWords working sometimes, and sometimes not.
> 2. and stemDocuments absolutely not working.  
>
> Both error out.  removeWords always stops reading in the stopword list on
> the same line number  (I have added and subtracted words - no difference) -
> session info is below:
>
> > a <- tm_map(a, removeWords, stopwords("english"))
>
> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  :
>   invalid regular expression
> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
>
>
> > a <- tm_map(a, stemDocument, language = "english")
> Error in .jnew(name) : java.lang.ClassNotFoundException
This error suggests you should reconfigure Java. Have you tried
reinstalling rJava, Snowball, RWekajars and RWeka?

> SessionInfo:
>
> > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252  
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                          
> [5] LC_TIME=English_United States.1252    
>
> attached base packages:
> [1] stats4    grid      stats     graphics  grDevices utils     datasets
> [8] methods   base    
>
> other attached packages:
>  [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
>  [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8  
>  [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
> [13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1  
>
> loaded via a namespace (and not attached):
>  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17      
>  [5] memoise_0.1        munsell_0.3        proto_0.3-9.2    
> RColorBrewer_1.0-5
>  [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0      
> >
> Hi Triss,
>
> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument
>
> Best,
> -Alex
> ________________________________________
> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
> [triss.ashton@]
> Sent: 02 May 2012 21:09
> To: r-help@
> Subject: Re: [R] Help with stemDocument
>
> I am having a problem with stemDocuments also.  I can make it work by moving
> the data into a Corpus by using:
>
> >  a <- Corpus(VectorSource(df$text)) # create corpus object
> >  a <- tm_map(a, stemDocument, language = "english")
>
> but it is horrably slow.  I want to stem outside the Corpus object like:
>
> >df$text <- stemDocument(df$text, language = "english")
>
> but it returns the original text.
>
> In fact, using the example in the tm package documentation does not work
> either:
>
> > data("crude")
> > crude[[1]]
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]], language = "english") # specify language
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> > stemDocument(crude[[1]]) # language not specified
> Diamond Shamrock Corp said that
> effective today it had cut its contract prices for crude oil by
> 1.50 dlrs a barrel.
>     The reduction brings its posted price for West Texas
> Intermediate to 16.00 dlrs a barrel, the copany said.
>     "The price reduction today was made in the light of falling
> oil product prices and a weak crude oil market," a company
> spokeswoman said.
>     Diamond is the latest in a line of U.S. oil companies that
> have cut its contract, or posted, prices over the last two days
> citing weak oil markets.
>  Reuter
> >
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
> Sent from the R help mailing list archive at Nabble.com.
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...