How to reduce the sparseness in a TDM to make a cluster plot readable?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to reduce the sparseness in a TDM to make a cluster plot readable?

Andy Wolfe
Hello all

I am doing some text mining on a set of five plain text files and have
run into a snag when I run hclust in that there are just too many leaves
for anything to be read. It returns a solid black line.

The texts have been converted into a TDM which has a dim of 5,292 and 5
(as per 5 docs).

My code for removing sparsity is as follows:

 > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)

 > inspect(tdm2)

<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 10415/16045
Sparsity           : 61%
Maximal term length: 22
Weighting          : term frequency (tf)

While the tf-idf weighting returns this when 0.99999 sparseness is removed:

 > inspect(tdm.tfidf)
<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 7915/18545
Sparsity           : 70%
Maximal term length: 22
Weighting          : term frequency - inverse document frequency
(normalized) (tf-idf)

I have experimented by decreasing the value I use for decreasing
sparseness, and that helps a bit, for example:

 > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
 > inspect(tdm2)
<<TermDocumentMatrix (terms: 869, documents: 5)>>
Non-/sparse entries: 3976/369
Sparsity           : 8%
Maximal term length: 14
Weighting          : term frequency (tf)

But, no matter what I do, the resulting plot is unreadable. The code for
plotting the cluster is:

 > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
 > plot(hc, yaxt = 'n', main = "Hierarchical clustering")

Can someone kindly either advise me what I am doing wrong and/ or
signpost me to some detailed info on how to fix this.

Many thanks in anticipation.

Andy


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to reduce the sparseness in a TDM to make a cluster plot readable?

Abby Spurdle
I'm not familiar with these subjects.
And hopefully, someone who is, will offer some better suggestions.

But to get things started, maybe...
(1) What packages are you using (re: tdm)?
(2) Where does the problem happen, in dist, hclust, the plot method
for hclust, or in the package(s) you are using?
(3) Do you think you could produce a small reproducible example,
showing what is wrong, and explaining you would like it to do instead?

Note that if the problem relates to hclust, or the plot method, then
you should be able to produce a much simpler example.
e.g.

    mycount.matrix <- matrix (rpois (25000, 20),, 5)
    head (mycount.matrix, 3)
    tail (mycount.matrix, 3)

    plot (hclust (dist (mycount.matrix) ) )

On Tue, Sep 15, 2020 at 6:54 AM Andrew <[hidden email]> wrote:

>
> Hello all
>
> I am doing some text mining on a set of five plain text files and have
> run into a snag when I run hclust in that there are just too many leaves
> for anything to be read. It returns a solid black line.
>
> The texts have been converted into a TDM which has a dim of 5,292 and 5
> (as per 5 docs).
>
> My code for removing sparsity is as follows:
>
>  > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
>
>  > inspect(tdm2)
>
> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> Non-/sparse entries: 10415/16045
> Sparsity           : 61%
> Maximal term length: 22
> Weighting          : term frequency (tf)
>
> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
>
>  > inspect(tdm.tfidf)
> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> Non-/sparse entries: 7915/18545
> Sparsity           : 70%
> Maximal term length: 22
> Weighting          : term frequency - inverse document frequency
> (normalized) (tf-idf)
>
> I have experimented by decreasing the value I use for decreasing
> sparseness, and that helps a bit, for example:
>
>  > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
>  > inspect(tdm2)
> <<TermDocumentMatrix (terms: 869, documents: 5)>>
> Non-/sparse entries: 3976/369
> Sparsity           : 8%
> Maximal term length: 14
> Weighting          : term frequency (tf)
>
> But, no matter what I do, the resulting plot is unreadable. The code for
> plotting the cluster is:
>
>  > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
>  > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
>
> Can someone kindly either advise me what I am doing wrong and/ or
> signpost me to some detailed info on how to fix this.
>
> Many thanks in anticipation.
>
> Andy
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to reduce the sparseness in a TDM to make a cluster plot readable?

Andy Wolfe
Hi Abby

Many thanks for reaching out with an offer of help. Very much appreciated.

(1) The packages I'm using are 'tm' for text-mining and the TDM and for
the clustering it is 'cluster'
(2) Not sure where the problem is happening as it doesn't show up as an
error. Where it manifests is in the plotting, however logic would
suggest that it concerns the removal of sparse terms, so that would be
in the TDM process
(3) I don't think I can provide a reproducible example. When I practice
using data sets that packages provide, all is fine. The trouble is when
I apply it to my own data sets which are five documents, etc., as described.

I think the nub of it is really to find a way that I can subset the TDM
to return the twenty or thirty most frequently used words, and then to
plot those using hclust. However, when searching on-line I haven't been
able to find any suggestions on how to do that, nor is there any mention
of using that approach in the books and tutorials I have.

If you (or someone on this list) can advise on how I can sort the terms
in the TDM from most to least frequent, and then to subset the top
twenty or thirty most frequently occurring terms (preferably using tf as
well as tf-idf) and then I can plot that sub-set, then I think that that
would do the trick, and the terms would be plotted clearly and legibly.

Thanks again for your offer of help. I hope that my reply helps clarify
rather than muddy the situation.

Best wishes
Andy


On 17/09/2020 08:43, Abby Spurdle wrote:

> I'm not familiar with these subjects.
> And hopefully, someone who is, will offer some better suggestions.
>
> But to get things started, maybe...
> (1) What packages are you using (re: tdm)?
> (2) Where does the problem happen, in dist, hclust, the plot method
> for hclust, or in the package(s) you are using?
> (3) Do you think you could produce a small reproducible example,
> showing what is wrong, and explaining you would like it to do instead?
>
> Note that if the problem relates to hclust, or the plot method, then
> you should be able to produce a much simpler example.
> e.g.
>
>      mycount.matrix <- matrix (rpois (25000, 20),, 5)
>      head (mycount.matrix, 3)
>      tail (mycount.matrix, 3)
>
>      plot (hclust (dist (mycount.matrix) ) )
>
> On Tue, Sep 15, 2020 at 6:54 AM Andrew <[hidden email]> wrote:
>> Hello all
>>
>> I am doing some text mining on a set of five plain text files and have
>> run into a snag when I run hclust in that there are just too many leaves
>> for anything to be read. It returns a solid black line.
>>
>> The texts have been converted into a TDM which has a dim of 5,292 and 5
>> (as per 5 docs).
>>
>> My code for removing sparsity is as follows:
>>
>>   > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
>>
>>   > inspect(tdm2)
>>
>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>> Non-/sparse entries: 10415/16045
>> Sparsity           : 61%
>> Maximal term length: 22
>> Weighting          : term frequency (tf)
>>
>> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
>>
>>   > inspect(tdm.tfidf)
>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>> Non-/sparse entries: 7915/18545
>> Sparsity           : 70%
>> Maximal term length: 22
>> Weighting          : term frequency - inverse document frequency
>> (normalized) (tf-idf)
>>
>> I have experimented by decreasing the value I use for decreasing
>> sparseness, and that helps a bit, for example:
>>
>>   > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
>>   > inspect(tdm2)
>> <<TermDocumentMatrix (terms: 869, documents: 5)>>
>> Non-/sparse entries: 3976/369
>> Sparsity           : 8%
>> Maximal term length: 14
>> Weighting          : term frequency (tf)
>>
>> But, no matter what I do, the resulting plot is unreadable. The code for
>> plotting the cluster is:
>>
>>   > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
>>   > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
>>
>> Can someone kindly either advise me what I am doing wrong and/ or
>> signpost me to some detailed info on how to fix this.
>>
>> Many thanks in anticipation.
>>
>> Andy
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to reduce the sparseness in a TDM to make a cluster plot readable?

Jim Lemon-4
Hi Andrew,
From your last email the answer to your problem may be the
findFreqTerms() function. Just increase the number of times a term has
to appear and check the result until you get the matrix size that you
want.

Jim

On Fri, Sep 18, 2020 at 5:32 PM Andrew <[hidden email]> wrote:

>
> Hi Abby
>
> Many thanks for reaching out with an offer of help. Very much appreciated.
>
> (1) The packages I'm using are 'tm' for text-mining and the TDM and for
> the clustering it is 'cluster'
> (2) Not sure where the problem is happening as it doesn't show up as an
> error. Where it manifests is in the plotting, however logic would
> suggest that it concerns the removal of sparse terms, so that would be
> in the TDM process
> (3) I don't think I can provide a reproducible example. When I practice
> using data sets that packages provide, all is fine. The trouble is when
> I apply it to my own data sets which are five documents, etc., as described.
>
> I think the nub of it is really to find a way that I can subset the TDM
> to return the twenty or thirty most frequently used words, and then to
> plot those using hclust. However, when searching on-line I haven't been
> able to find any suggestions on how to do that, nor is there any mention
> of using that approach in the books and tutorials I have.
>
> If you (or someone on this list) can advise on how I can sort the terms
> in the TDM from most to least frequent, and then to subset the top
> twenty or thirty most frequently occurring terms (preferably using tf as
> well as tf-idf) and then I can plot that sub-set, then I think that that
> would do the trick, and the terms would be plotted clearly and legibly.
>
> Thanks again for your offer of help. I hope that my reply helps clarify
> rather than muddy the situation.
>
> Best wishes
> Andy
>
>
> On 17/09/2020 08:43, Abby Spurdle wrote:
> > I'm not familiar with these subjects.
> > And hopefully, someone who is, will offer some better suggestions.
> >
> > But to get things started, maybe...
> > (1) What packages are you using (re: tdm)?
> > (2) Where does the problem happen, in dist, hclust, the plot method
> > for hclust, or in the package(s) you are using?
> > (3) Do you think you could produce a small reproducible example,
> > showing what is wrong, and explaining you would like it to do instead?
> >
> > Note that if the problem relates to hclust, or the plot method, then
> > you should be able to produce a much simpler example.
> > e.g.
> >
> >      mycount.matrix <- matrix (rpois (25000, 20),, 5)
> >      head (mycount.matrix, 3)
> >      tail (mycount.matrix, 3)
> >
> >      plot (hclust (dist (mycount.matrix) ) )
> >
> > On Tue, Sep 15, 2020 at 6:54 AM Andrew <[hidden email]> wrote:
> >> Hello all
> >>
> >> I am doing some text mining on a set of five plain text files and have
> >> run into a snag when I run hclust in that there are just too many leaves
> >> for anything to be read. It returns a solid black line.
> >>
> >> The texts have been converted into a TDM which has a dim of 5,292 and 5
> >> (as per 5 docs).
> >>
> >> My code for removing sparsity is as follows:
> >>
> >>   > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
> >>
> >>   > inspect(tdm2)
> >>
> >> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> >> Non-/sparse entries: 10415/16045
> >> Sparsity           : 61%
> >> Maximal term length: 22
> >> Weighting          : term frequency (tf)
> >>
> >> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
> >>
> >>   > inspect(tdm.tfidf)
> >> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> >> Non-/sparse entries: 7915/18545
> >> Sparsity           : 70%
> >> Maximal term length: 22
> >> Weighting          : term frequency - inverse document frequency
> >> (normalized) (tf-idf)
> >>
> >> I have experimented by decreasing the value I use for decreasing
> >> sparseness, and that helps a bit, for example:
> >>
> >>   > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
> >>   > inspect(tdm2)
> >> <<TermDocumentMatrix (terms: 869, documents: 5)>>
> >> Non-/sparse entries: 3976/369
> >> Sparsity           : 8%
> >> Maximal term length: 14
> >> Weighting          : term frequency (tf)
> >>
> >> But, no matter what I do, the resulting plot is unreadable. The code for
> >> plotting the cluster is:
> >>
> >>   > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
> >>   > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
> >>
> >> Can someone kindly either advise me what I am doing wrong and/ or
> >> signpost me to some detailed info on how to fix this.
> >>
> >> Many thanks in anticipation.
> >>
> >> Andy
> >>
> >>
> >>          [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to reduce the sparseness in a TDM to make a cluster plot readable?

Andy Wolfe
Hello Jim

Thanks for that. I'll read up on it and will give it a go, either later
today or tomorrow. I am assuming this will work for both tf and tf-idf
weighted TDMs?

Much appreciated. :-)

Best wishes
Andy


On 18/09/2020 09:18, Jim Lemon wrote:

> Hi Andrew,
> >From your last email the answer to your problem may be the
> findFreqTerms() function. Just increase the number of times a term has
> to appear and check the result until you get the matrix size that you
> want.
>
> Jim
>
> On Fri, Sep 18, 2020 at 5:32 PM Andrew <[hidden email]> wrote:
>> Hi Abby
>>
>> Many thanks for reaching out with an offer of help. Very much appreciated.
>>
>> (1) The packages I'm using are 'tm' for text-mining and the TDM and for
>> the clustering it is 'cluster'
>> (2) Not sure where the problem is happening as it doesn't show up as an
>> error. Where it manifests is in the plotting, however logic would
>> suggest that it concerns the removal of sparse terms, so that would be
>> in the TDM process
>> (3) I don't think I can provide a reproducible example. When I practice
>> using data sets that packages provide, all is fine. The trouble is when
>> I apply it to my own data sets which are five documents, etc., as described.
>>
>> I think the nub of it is really to find a way that I can subset the TDM
>> to return the twenty or thirty most frequently used words, and then to
>> plot those using hclust. However, when searching on-line I haven't been
>> able to find any suggestions on how to do that, nor is there any mention
>> of using that approach in the books and tutorials I have.
>>
>> If you (or someone on this list) can advise on how I can sort the terms
>> in the TDM from most to least frequent, and then to subset the top
>> twenty or thirty most frequently occurring terms (preferably using tf as
>> well as tf-idf) and then I can plot that sub-set, then I think that that
>> would do the trick, and the terms would be plotted clearly and legibly.
>>
>> Thanks again for your offer of help. I hope that my reply helps clarify
>> rather than muddy the situation.
>>
>> Best wishes
>> Andy
>>
>>
>> On 17/09/2020 08:43, Abby Spurdle wrote:
>>> I'm not familiar with these subjects.
>>> And hopefully, someone who is, will offer some better suggestions.
>>>
>>> But to get things started, maybe...
>>> (1) What packages are you using (re: tdm)?
>>> (2) Where does the problem happen, in dist, hclust, the plot method
>>> for hclust, or in the package(s) you are using?
>>> (3) Do you think you could produce a small reproducible example,
>>> showing what is wrong, and explaining you would like it to do instead?
>>>
>>> Note that if the problem relates to hclust, or the plot method, then
>>> you should be able to produce a much simpler example.
>>> e.g.
>>>
>>>       mycount.matrix <- matrix (rpois (25000, 20),, 5)
>>>       head (mycount.matrix, 3)
>>>       tail (mycount.matrix, 3)
>>>
>>>       plot (hclust (dist (mycount.matrix) ) )
>>>
>>> On Tue, Sep 15, 2020 at 6:54 AM Andrew <[hidden email]> wrote:
>>>> Hello all
>>>>
>>>> I am doing some text mining on a set of five plain text files and have
>>>> run into a snag when I run hclust in that there are just too many leaves
>>>> for anything to be read. It returns a solid black line.
>>>>
>>>> The texts have been converted into a TDM which has a dim of 5,292 and 5
>>>> (as per 5 docs).
>>>>
>>>> My code for removing sparsity is as follows:
>>>>
>>>>    > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
>>>>
>>>>    > inspect(tdm2)
>>>>
>>>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>>>> Non-/sparse entries: 10415/16045
>>>> Sparsity           : 61%
>>>> Maximal term length: 22
>>>> Weighting          : term frequency (tf)
>>>>
>>>> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
>>>>
>>>>    > inspect(tdm.tfidf)
>>>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>>>> Non-/sparse entries: 7915/18545
>>>> Sparsity           : 70%
>>>> Maximal term length: 22
>>>> Weighting          : term frequency - inverse document frequency
>>>> (normalized) (tf-idf)
>>>>
>>>> I have experimented by decreasing the value I use for decreasing
>>>> sparseness, and that helps a bit, for example:
>>>>
>>>>    > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
>>>>    > inspect(tdm2)
>>>> <<TermDocumentMatrix (terms: 869, documents: 5)>>
>>>> Non-/sparse entries: 3976/369
>>>> Sparsity           : 8%
>>>> Maximal term length: 14
>>>> Weighting          : term frequency (tf)
>>>>
>>>> But, no matter what I do, the resulting plot is unreadable. The code for
>>>> plotting the cluster is:
>>>>
>>>>    > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
>>>>    > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
>>>>
>>>> Can someone kindly either advise me what I am doing wrong and/ or
>>>> signpost me to some detailed info on how to fix this.
>>>>
>>>> Many thanks in anticipation.
>>>>
>>>> Andy
>>>>
>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.