word stemming for corpus linguistics

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

word stemming for corpus linguistics

Andy Wolfe
Hi list

On a piece of work I'm doing in corpus linguistics, using a combo of
texts by Gries "Quantitative Corpus Linguistics with R: A Practical
Introduction" and Jockers "Text Analysis with R for Students of
Literature", which are both really excellent by the way, I want to stem
or lemmatize the words so that, for e.g., 'facilitating', 'facilitated',
and 'facilitates' all become 'facilit'.

In text mining, using a combination of the packages 'tm' and 'SnowballC'
this is feasible, but then I am finding that working with the DTM
(document term matrix) becomes difficult for when I want to do
concordance (or key word in context) analysis.

So, two questions:

(1) is there a package for R version 3.3.1 that can work with corpus
linguistics? and/ or

(2) is there a way of doing concordance analysis using the tm package as
part of the whole text mining process?

I appreciate any help. Thanks.

Andy


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: word stemming for corpus linguistics

Paul Johnston-2
Suggest look at http://www.inside-r.org/packages/cran/tm/docs/stemDocument



-----Original Message-----
From: R-help [mailto:[hidden email]] On Behalf Of Andy Wolfe
Sent: 26 July 2016 08:10
To: [hidden email]
Subject: [R] word stemming for corpus linguistics

Hi list

On a piece of work I'm doing in corpus linguistics, using a combo of texts by Gries "Quantitative Corpus Linguistics with R: A Practical Introduction" and Jockers "Text Analysis with R for Students of Literature", which are both really excellent by the way, I want to stem or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', and 'facilitates' all become 'facilit'.

In text mining, using a combination of the packages 'tm' and 'SnowballC'
this is feasible, but then I am finding that working with the DTM (document term matrix) becomes difficult for when I want to do concordance (or key word in context) analysis.

So, two questions:

(1) is there a package for R version 3.3.1 that can work with corpus linguistics? and/ or

(2) is there a way of doing concordance analysis using the tm package as part of the whole text mining process?

I appreciate any help. Thanks.

Andy


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: word stemming for corpus linguistics

Andy Wolfe
Hi Paul

I have seen this - it's part of the tm package mentioned originally. So,
I've tried it again and perhaps I'm using stemDocument incorrectly, but
this is what I am doing:

# > library(tm)
Loading required package: NLP
 > text.v <- scan(file.choose(), what = 'char', sep = '\n')
Read 938 items
# >text.stem.v <- stemDocument(text.v, language = 'english')

But it isn't changing anything in the body of the text I'm passing to it
- the words are unlemmatized/ unstemmed.

When I try using SnowballC, the error returned is that tm_map doesn't
have a method to work with objects of class 'character'.

Again, the problem is that tm doesn't seem to allow for concordance
analysis ... or perhaps it does and I just haven't figured out how to do
it, so am happy to be shown some documentation on that process, and
whether that is applied before or after the text is transformed into a
DTM because searching on-line hasn't (yet) thrown anything back.

Thanks.
Andy


On 26/07/16 08:50, Paul Johnston wrote:

> Suggest look at http://www.inside-r.org/packages/cran/tm/docs/stemDocument
>
>
>
> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Andy Wolfe
> Sent: 26 July 2016 08:10
> To: [hidden email]
> Subject: [R] word stemming for corpus linguistics
>
> Hi list
>
> On a piece of work I'm doing in corpus linguistics, using a combo of texts by Gries "Quantitative Corpus Linguistics with R: A Practical Introduction" and Jockers "Text Analysis with R for Students of Literature", which are both really excellent by the way, I want to stem or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', and 'facilitates' all become 'facilit'.
>
> In text mining, using a combination of the packages 'tm' and 'SnowballC'
> this is feasible, but then I am finding that working with the DTM (document term matrix) becomes difficult for when I want to do concordance (or key word in context) analysis.
>
> So, two questions:
>
> (1) is there a package for R version 3.3.1 that can work with corpus linguistics? and/ or
>
> (2) is there a way of doing concordance analysis using the tm package as part of the whole text mining process?
>
> I appreciate any help. Thanks.
>
> Andy
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: word stemming for corpus linguistics

Paul Johnston-2

Hi

I use the tm_map() with stemDocument used as an argument

Looking at a particular file before stemming

writeLines(as.character(data_mined_volatile[[1]]))

## The European Union is a "force for social injustice" which backs "the haves rather than the have-nots", Iain Duncan Smith has said.
## The ex-work and pensions secretary said "uncontrolled migration" drove down wages and increased the cost of living.
## He appealed to people "who may have done OK from the EU" to "think about the people that haven't".
## But Labour's Alan Johnson said the EU protected workers and stopped them from being "exploited".
## The former Labour home secretary accused the Leave campaign of dismissing such protections as "red tape".
## In other EU referendum campaign developments:
## Thirteen former US secretaries of state and defence and national security advisers, including Madeleine Albright and Leon Panetta, say in a letter to the Times that the UK's "place and influence" in the world would be diminished if it left the EU - and Europe would be "dangerously weakened"
## A British Chambers of Commerce survey suggests most business people back Remain but the gap with those backing Leave has narrowed.
## Five former heads of Nato claimed the UK would lose influence and "give succour to its enemies" by leaving the EU - claims dismissed as scaremongering by Boris Johnson
## Mr Corbyn is launching his party's battle bus, saying Labour votes will be crucial if the Remain side is to win
## The official Scottish campaign to keep the UK in the European Union is due to be launched in Edinburgh
## Mr Duncan Smith's speech came after he told the Sun Germany had a "de facto veto" over David Cameron's EU renegotiations, with Angela Merkel blocking the PM's plans for an "emergency brake" on EU migration.
## Downing Street said curbs it negotiated on in-work benefits for EU migrants were a "more effective" way forward.
## Follow the latest developments on BBC EU referendum live
## Laura Kuenssberg: Can Leave win over the have-nots


Now look at the same text after stemming

corpus <- data_mined_volatile
corpus <- tm_map(corpus,stemDocument)

writeLines(as.character(corpus[[1]]))

## The European Union is a "forc for social injustice" which back "the have rather than the have-nots", Iain Duncan Smith has said.
## The ex-work and pension secretari said "uncontrol migration" drove down wage and increas the cost of living.
## He appeal to peopl "who may have done OK from the EU" to "think about the peopl that haven't".
## But Labour Alan Johnson said the EU protect worker and stop them from be "exploited".
## The former Labour home secretari accus the Leav campaign of dismiss such protect as "red tape".
2
## In other EU referendum campaign developments:
## Thirteen former US secretari of state and defenc and nation secur advisers, includ Madelein Albright and Leon Panetta, say in a letter to the Time that the UK "place and influence" in the world would be diminish if it left the EU - and Europ would be "danger weakened"
## A British Chamber of Commerc survey suggest most busi peopl back Remain but the gap with those back Leav has narrowed.
## Five former head of Nato claim the UK would lose influenc and "give succour to it enemies" by leav the EU - claim dismiss as scaremong by Bori Johnson
## Mr Corbyn is launch his parti battl bus, say Labour vote will be crucial if the Remain side is to win
## The offici Scottish campaign to keep the UK in the European Union is due to be launch in Edinburgh
## Mr Duncan Smith speech came after he told the Sun Germani had a "de facto veto" over David Cameron EU renegotiations, with Angela Merkel block the PM plan for an "emerg brake" on EU migration.
## Down Street said curb it negoti on in-work benefit for EU migrant were a "more effective" way forward.
## Follow the latest develop on BBC EU referendum live
## Laura Kuenssberg: Can Leav win over the have-not

-----Original Message-----
From: R-help [mailto:[hidden email]] On Behalf Of Andy Wolfe
Sent: 26 July 2016 09:14
To: [hidden email]
Subject: Re: [R] word stemming for corpus linguistics

Hi Paul

I have seen this - it's part of the tm package mentioned originally. So, I've tried it again and perhaps I'm using stemDocument incorrectly, but this is what I am doing:

# > library(tm)
Loading required package: NLP
 > text.v <- scan(file.choose(), what = 'char', sep = '\n') Read 938 items # >text.stem.v <- stemDocument(text.v, language = 'english')

But it isn't changing anything in the body of the text I'm passing to it
- the words are unlemmatized/ unstemmed.

When I try using SnowballC, the error returned is that tm_map doesn't have a method to work with objects of class 'character'.

Again, the problem is that tm doesn't seem to allow for concordance analysis ... or perhaps it does and I just haven't figured out how to do it, so am happy to be shown some documentation on that process, and whether that is applied before or after the text is transformed into a DTM because searching on-line hasn't (yet) thrown anything back.

Thanks.
Andy


On 26/07/16 08:50, Paul Johnston wrote:

> Suggest look at
> http://www.inside-r.org/packages/cran/tm/docs/stemDocument
>
>
>
> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Andy
> Wolfe
> Sent: 26 July 2016 08:10
> To: [hidden email]
> Subject: [R] word stemming for corpus linguistics
>
> Hi list
>
> On a piece of work I'm doing in corpus linguistics, using a combo of texts by Gries "Quantitative Corpus Linguistics with R: A Practical Introduction" and Jockers "Text Analysis with R for Students of Literature", which are both really excellent by the way, I want to stem or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', and 'facilitates' all become 'facilit'.
>
> In text mining, using a combination of the packages 'tm' and 'SnowballC'
> this is feasible, but then I am finding that working with the DTM (document term matrix) becomes difficult for when I want to do concordance (or key word in context) analysis.
>
> So, two questions:
>
> (1) is there a package for R version 3.3.1 that can work with corpus
> linguistics? and/ or
>
> (2) is there a way of doing concordance analysis using the tm package as part of the whole text mining process?
>
> I appreciate any help. Thanks.
>
> Andy
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: word stemming for corpus linguistics

Andy Wolfe
Hi

Thanks for following up on this thread.

I've opted for this, albeit circuitous, route: use the tm package to
stem the document and then use writeCorpus to write the stemmed document
to disk, so that I can open it up and do the concordancing piece.

Many thanks - this'll do me fine until I come across a better (read,
more elegant) solution.
Best
Andy


On 26/07/16 14:05, Paul Johnston wrote:

> Hi
>
> I use the tm_map() with stemDocument used as an argument
>
> Looking at a particular file before stemming
>
> writeLines(as.character(data_mined_volatile[[1]]))
>
> ## The European Union is a "force for social injustice" which backs "the haves rather than the have-nots", Iain Duncan Smith has said.
> ## The ex-work and pensions secretary said "uncontrolled migration" drove down wages and increased the cost of living.
> ## He appealed to people "who may have done OK from the EU" to "think about the people that haven't".
> ## But Labour's Alan Johnson said the EU protected workers and stopped them from being "exploited".
> ## The former Labour home secretary accused the Leave campaign of dismissing such protections as "red tape".
> ## In other EU referendum campaign developments:
> ## Thirteen former US secretaries of state and defence and national security advisers, including Madeleine Albright and Leon Panetta, say in a letter to the Times that the UK's "place and influence" in the world would be diminished if it left the EU - and Europe would be "dangerously weakened"
> ## A British Chambers of Commerce survey suggests most business people back Remain but the gap with those backing Leave has narrowed.
> ## Five former heads of Nato claimed the UK would lose influence and "give succour to its enemies" by leaving the EU - claims dismissed as scaremongering by Boris Johnson
> ## Mr Corbyn is launching his party's battle bus, saying Labour votes will be crucial if the Remain side is to win
> ## The official Scottish campaign to keep the UK in the European Union is due to be launched in Edinburgh
> ## Mr Duncan Smith's speech came after he told the Sun Germany had a "de facto veto" over David Cameron's EU renegotiations, with Angela Merkel blocking the PM's plans for an "emergency brake" on EU migration.
> ## Downing Street said curbs it negotiated on in-work benefits for EU migrants were a "more effective" way forward.
> ## Follow the latest developments on BBC EU referendum live
> ## Laura Kuenssberg: Can Leave win over the have-nots
>
>
> Now look at the same text after stemming
>
> corpus <- data_mined_volatile
> corpus <- tm_map(corpus,stemDocument)
>
> writeLines(as.character(corpus[[1]]))
>
> ## The European Union is a "forc for social injustice" which back "the have rather than the have-nots", Iain Duncan Smith has said.
> ## The ex-work and pension secretari said "uncontrol migration" drove down wage and increas the cost of living.
> ## He appeal to peopl "who may have done OK from the EU" to "think about the peopl that haven't".
> ## But Labour Alan Johnson said the EU protect worker and stop them from be "exploited".
> ## The former Labour home secretari accus the Leav campaign of dismiss such protect as "red tape".
> 2
> ## In other EU referendum campaign developments:
> ## Thirteen former US secretari of state and defenc and nation secur advisers, includ Madelein Albright and Leon Panetta, say in a letter to the Time that the UK "place and influence" in the world would be diminish if it left the EU - and Europ would be "danger weakened"
> ## A British Chamber of Commerc survey suggest most busi peopl back Remain but the gap with those back Leav has narrowed.
> ## Five former head of Nato claim the UK would lose influenc and "give succour to it enemies" by leav the EU - claim dismiss as scaremong by Bori Johnson
> ## Mr Corbyn is launch his parti battl bus, say Labour vote will be crucial if the Remain side is to win
> ## The offici Scottish campaign to keep the UK in the European Union is due to be launch in Edinburgh
> ## Mr Duncan Smith speech came after he told the Sun Germani had a "de facto veto" over David Cameron EU renegotiations, with Angela Merkel block the PM plan for an "emerg brake" on EU migration.
> ## Down Street said curb it negoti on in-work benefit for EU migrant were a "more effective" way forward.
> ## Follow the latest develop on BBC EU referendum live
> ## Laura Kuenssberg: Can Leav win over the have-not
>
> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Andy Wolfe
> Sent: 26 July 2016 09:14
> To: [hidden email]
> Subject: Re: [R] word stemming for corpus linguistics
>
> Hi Paul
>
> I have seen this - it's part of the tm package mentioned originally. So, I've tried it again and perhaps I'm using stemDocument incorrectly, but this is what I am doing:
>
> # > library(tm)
> Loading required package: NLP
>   > text.v <- scan(file.choose(), what = 'char', sep = '\n') Read 938 items # >text.stem.v <- stemDocument(text.v, language = 'english')
>
> But it isn't changing anything in the body of the text I'm passing to it
> - the words are unlemmatized/ unstemmed.
>
> When I try using SnowballC, the error returned is that tm_map doesn't have a method to work with objects of class 'character'.
>
> Again, the problem is that tm doesn't seem to allow for concordance analysis ... or perhaps it does and I just haven't figured out how to do it, so am happy to be shown some documentation on that process, and whether that is applied before or after the text is transformed into a DTM because searching on-line hasn't (yet) thrown anything back.
>
> Thanks.
> Andy
>
>
> On 26/07/16 08:50, Paul Johnston wrote:
>> Suggest look at
>> http://www.inside-r.org/packages/cran/tm/docs/stemDocument
>>
>>
>>
>> -----Original Message-----
>> From: R-help [mailto:[hidden email]] On Behalf Of Andy
>> Wolfe
>> Sent: 26 July 2016 08:10
>> To: [hidden email]
>> Subject: [R] word stemming for corpus linguistics
>>
>> Hi list
>>
>> On a piece of work I'm doing in corpus linguistics, using a combo of texts by Gries "Quantitative Corpus Linguistics with R: A Practical Introduction" and Jockers "Text Analysis with R for Students of Literature", which are both really excellent by the way, I want to stem or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', and 'facilitates' all become 'facilit'.
>>
>> In text mining, using a combination of the packages 'tm' and 'SnowballC'
>> this is feasible, but then I am finding that working with the DTM (document term matrix) becomes difficult for when I want to do concordance (or key word in context) analysis.
>>
>> So, two questions:
>>
>> (1) is there a package for R version 3.3.1 that can work with corpus
>> linguistics? and/ or
>>
>> (2) is there a way of doing concordance analysis using the tm package as part of the whole text mining process?
>>
>> I appreciate any help. Thanks.
>>
>> Andy
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.