find similar words in text

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

find similar words in text

Riaan Van Der Walt
I am new to R.
Busy with Text Analysis.
 
Need a script to find e.g
 
whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
 
Riaan
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: find similar words in text

Boris Steipe
You need a stemming algorithm. See here:
  https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Myself, I've had good experience with Rstem.

B.





> On Jul 31, 2017, at 4:47 PM, Riaan Van Der Walt <[hidden email]> wrote:
>
> I am new to R.
> Busy with Text Analysis.
>
> Need a script to find e.g
>
> whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
>
> Riaan
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: find similar words in text

Bert Gunter-2
In reply to this post by Riaan Van Der Walt
**Before** posting:

1. Search: e.g. "text processing R"

2. Check CRAN Task views:
e.g. "Natural Language Processing"
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

3. Use R's search facility:  e.g. help.search("character") which would
lead you to ?grep among others, which might suggest something like

grep("whal",strsplit(yourtext, split = " ", fixed = TRUE), fixed = TRUE)

... although this is likely too simple minded for a text as large as
Moby Dick. But it depends on what you want to do.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jul 31, 2017 at 1:47 PM, Riaan Van Der Walt
<[hidden email]> wrote:

> I am new to R.
> Busy with Text Analysis.
>
> Need a script to find e.g
>
> whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
>
> Riaan
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

find similar words in text

Riaan Van Der Walt
In reply to this post by Riaan Van Der Walt
I received this from Matt Jockers and it worked!
I missed something.
How can I now see(display) this list?
Hi Riann,

There are a couple of ways that you could do this. . . the best
approach would probably be to use *grep* instead of *which*, but let me
show you both ways.

On page 30, replace
whales.v <- which(moby.word.v == *whale*)
with
whale_words <- c(*whale", *whales", *whale's", *whaler", *whalers",
*whaling")
whales.v <- which(moby.word.v %in% whale_words)

the alternative (better) way to do this, with grep, looks like this

whales.v <- grep(*^whal.*", moby.word.v)

grep uses the regular expression ^whal.* to find all words starting (^)
with *whal* followed by any number of other characters (.*)

All best,

Matt

--
Matthew L. Jockers
Associate Dean for Research and Partnerships
College of Arts & Sciences
Susan J. Rosowski Associate Professor of English
University of Nebraska-Lincoln
1223 Oldfather Hall
P.O. Box 880312
Lincoln, NE  68588-0312
402.472.2891
www.matthewjockers.net
 
I am new to R.
Busy with Text Analysis.
 
Need a script to find e.g
 
whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
 
Riaan
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: find similar words in text

Bert Gunter-2
You really need to spend some time learning R if you wish to use R.

See ?grep and note the "value" argument. So you want:

whales.v <- grep(*^whal.*", moby.word.v,value = TRUE)

-- Bert




On Thu, Aug 3, 2017 at 5:14 AM, Riaan Van Der Walt
<[hidden email]> wrote:

> I received this from Matt Jockers and it worked!
> I missed something.
> How can I now see(display) this list?
> Hi Riann,
>
> There are a couple of ways that you could do this. . . the best
> approach would probably be to use *grep* instead of *which*, but let me
> show you both ways.
>
> On page 30, replace
> whales.v <- which(moby.word.v == *whale*)
> with
> whale_words <- c(*whale", *whales", *whale's", *whaler", *whalers",
> *whaling")
> whales.v <- which(moby.word.v %in% whale_words)
>
> the alternative (better) way to do this, with grep, looks like this
>
> whales.v <- grep(*^whal.*", moby.word.v)
>
> grep uses the regular expression ^whal.* to find all words starting (^)
> with *whal* followed by any number of other characters (.*)
>
> All best,
>
> Matt
>
> --
> Matthew L. Jockers
> Associate Dean for Research and Partnerships
> College of Arts & Sciences
> Susan J. Rosowski Associate Professor of English
> University of Nebraska-Lincoln
> 1223 Oldfather Hall
> P.O. Box 880312
> Lincoln, NE  68588-0312
> 402.472.2891
> www.matthewjockers.net
>
> I am new to R.
> Busy with Text Analysis.
>
> Need a script to find e.g
>
> whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
>
> Riaan
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: find similar words in text

Boris Steipe
In reply to this post by Boris Steipe
Please keep messages on the list so others can pitch in.

_Which_ words do you want to consider identical for the purpose of frequency count?
_What_ do you want to plot?



B.



> On Aug 3, 2017, at 4:36 PM, Riaan Van Der Walt <[hidden email]> wrote:
>
> Hallo Boris,
> I've loaded the Rstem, Snowball.
> But I am clueless how to get a list eg. whal* (whale, whales, whaling, whaler, whalers, whaleman, whalemen, whale-ship, whale-boat, whale's)
> in the book Moby Dick and the frequency of each of the different words.
> I'am usig this script:
>  
> whales1.v <- grep("^whal.*", moby.word.v)
> whales1.v
>  
> The total occurrence for whal* is 1699.
> But I can't display it or plot it.
>  
> I am new to R and the learning curve is steep!!
>  
> Thx!
> Riaan
>
>
> Riaan van der Walt
> Tel / Phone / Mogala : 27+72+2172429
> Email / Epos / Emeile: [hidden email]
> Url: http://www.nwu.ac.za/
>  
> >>> Boris Steipe <[hidden email]> 31 Jul 2017 23:37 >>>
> You need a stemming algorithm. See here:
>   https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>
> Myself, I've had good experience with Rstem.
>
> B.
>
>
>
>
>
> > On Jul 31, 2017, at 4:47 PM, Riaan Van Der Walt <[hidden email]> wrote:
> >
> > I am new to R.
> > Busy with Text Analysis.
> >
> > Need a script to find e.g
> >
> > whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
> >
> > Riaan
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> <Riaan Van Der Walt.vcf>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: find similar words in text

Patrick Casimir
Use tm package and create a corpus to capture terms from the TDM within the corpus. Then you can apply as.matrix() to display terms' occurences. Go to CRAN and read about tm package.
________________________________
From: R-help <[hidden email]> on behalf of Boris Steipe <[hidden email]>
Sent: Thursday, August 3, 2017 6:40:09 PM
To: Riaan Van Der Walt
Cc: R lists
Subject: Re: [R] find similar words in text

Please keep messages on the list so others can pitch in.

_Which_ words do you want to consider identical for the purpose of frequency count?
_What_ do you want to plot?



B.



> On Aug 3, 2017, at 4:36 PM, Riaan Van Der Walt <[hidden email]> wrote:
>
> Hallo Boris,
> I've loaded the Rstem, Snowball.
> But I am clueless how to get a list eg. whal* (whale, whales, whaling, whaler, whalers, whaleman, whalemen, whale-ship, whale-boat, whale's)
> in the book Moby Dick and the frequency of each of the different words.
> I'am usig this script:
>
> whales1.v <- grep("^whal.*", moby.word.v)
> whales1.v
>
> The total occurrence for whal* is 1699.
> But I can't display it or plot it.
>
> I am new to R and the learning curve is steep!!
>
> Thx!
> Riaan
>
>
> Riaan van der Walt
> Tel / Phone / Mogala : 27+72+2172429
> Email / Epos / Emeile: [hidden email]
> Url: http://www.nwu.ac.za/
>
> >>> Boris Steipe <[hidden email]> 31 Jul 2017 23:37 >>>
> You need a stemming algorithm. See here:
>   https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>
> Myself, I've had good experience with Rstem.
>
> B.
>
>
>
>
>
> > On Jul 31, 2017, at 4:47 PM, Riaan Van Der Walt <[hidden email]> wrote:
> >
> > I am new to R.
> > Busy with Text Analysis.
> >
> > Need a script to find e.g
> >
> > whale, whales, whale's, whaler, whalers, whaling,... in Moby Dick
> >
> > Riaan
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> <Riaan Van Der Walt.vcf>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.