Quantcast

Manually modifying an hclust dendrogram to remove singletons

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Manually modifying an hclust dendrogram to remove singletons

r-help.20.trevva
Dear R-Help,

I have a clustering problem with hclust that I hope someone can help
me with. Consider the classic hclust example:

     hc <- hclust(dist(USArrests), "ave")
     plot(hc)

I would like to cut the tree up in such a way so as to avoid small
clusters, so that we get a minimum number of items in each cluster,
and therefore avoid singletons. e.g. in this example, you can see that
Hawaii is split off onto its own at quite a high level. I would like
to avoid having a single item clustered on its own like this. How can
I achieve this?

I have tried manually modifying the tree using dendrapply but have not
been able to produce a valid solution thus far..

Suggestions are welcome.

Best wishes,

Mark

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Manually modifying an hclust dendrogram to remove singletons

ilai-2
Can't put my finger on it but something about your idea rubs me the
wrong way. Maybe it's that the tree depends on the hierarchical
clustering algorithm and the choice on how to trim it should be based
on something more defensible than "avoid singletons". In this example
Hawaii is really different than New Hampshire, why would you want them
clustered together ?

But, it's your work, field of study, whatever. If you are going to do
it anyway, one way would be to loop over cut heights:

 hc <- hclust(dist(USArrests), "ave")
 plot(hc)
 hr <- range(hc$height)
 tol<- diff(hr)/100    # set tolerance level
 for(i in seq(1e-4+hr[1],hr[2],tol)){
 hcc <- rect.hclust(hc,h=i)
 if(all(sapply(hcc,length)>1)) break
 }
 str(hcc)

# or if you prefer dendrogram
 dend1<- as.dendrogram(hc)
 for(i in seq(1e-4+hr[1],hr[2],tol)){
 dend2 <- cut(dend1,h=i)
 if(all(sapply(dend2$lower,function(x) attr(x,'members'))>1)) break
 }
 dend2

Cheers

On Thu, May 24, 2012 at 10:31 AM,  <[hidden email]> wrote:

> Dear R-Help,
>
> I have a clustering problem with hclust that I hope someone can help
> me with. Consider the classic hclust example:
>
>     hc <- hclust(dist(USArrests), "ave")
>     plot(hc)
>
> I would like to cut the tree up in such a way so as to avoid small
> clusters, so that we get a minimum number of items in each cluster,
> and therefore avoid singletons. e.g. in this example, you can see that
> Hawaii is split off onto its own at quite a high level. I would like
> to avoid having a single item clustered on its own like this. How can
> I achieve this?
>
> I have tried manually modifying the tree using dendrapply but have not
> been able to produce a valid solution thus far..
>
> Suggestions are welcome.
>
> Best wishes,
>
> Mark
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Manually modifying an hclust dendrogram to remove singletons

plangfelder
In reply to this post by r-help.20.trevva
On Thu, May 24, 2012 at 9:31 AM,  <[hidden email]> wrote:

> Dear R-Help,
>
> I have a clustering problem with hclust that I hope someone can help
> me with. Consider the classic hclust example:
>
>     hc <- hclust(dist(USArrests), "ave")
>     plot(hc)
>
> I would like to cut the tree up in such a way so as to avoid small
> clusters, so that we get a minimum number of items in each cluster,
> and therefore avoid singletons. e.g. in this example, you can see that
> Hawaii is split off onto its own at quite a high level. I would like
> to avoid having a single item clustered on its own like this. How can
> I achieve this?
>
> I have tried manually modifying the tree using dendrapply but have not
> been able to produce a valid solution thus far..
>
> Suggestions are welcome.
>
> Best wishes,
>
> Mark

Hi Mark,

I'm not sure how you want to handle the singletons if you don't want
them in a separate cluster. The package WGCNA (I'm the maintainer) and
its dependency dynamicTreeCut contain a few ways of avoiding
singletons as separate clusters.

One way is to remove them from the resulting clusters. To this end,
use function cutreeStatic, specify the cut height and the minimum
number of elements in the cluster. For example,

clusters1 = cutreeStatic(hc, cutHeight = 35, minSize = 3);

This way all branches that have size below 3 are labeled 0.

To see what you get, use the function plotDendroAndColors like this:

plotDendroAndColors(hc, clusters1, rowText = clusters1 );

Each color corresponds to a cluster, and the cluster label is shown by
the numbers (each number is at the start of the corresponding
cluster).

If you'd like to assign everything but want to avoid cluster that are
too small, use the dynamic tree cut approach
(http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/BranchCutting/).
For example:

clusters2 = cutreeDynamic(hc, distM = as.matrix(dist(USArrests)),
minClusterSize = 3, deepSplit = 2)

To show the clusters:
plotDendroAndColors(hc, clusters2, rowText = clusters2 );

If you think the clusters are too big, try setting deepSplit=3 in the
cutreeDynamic call.

The dynamic tree cut basically assigns all singletons and branches
with size less than minClusterSize to the nearest existing cluster
(notice Hawai and the Florida/North Carolina branch), thus basically
combining hierarchical clustering and a PAM-like step Whether that's a
good approach for your research goal is a question you need to answer.

HTH,

Peter

>
> __________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Manually modifying an hclust dendrogram to remove singletons

Mark Payne-3
Hi,

Thanks for the replies - they have helped shaped my thinking and are
starting to push me in a better direction. Maybe I should explain a
little more about what I'm trying to achieve.

I am analysing satellite data across the global ocean, and am
interested in trying to classify areas of the ocean according to the
similarity between the pixels. Singletons in this case therefore
represent individual pixels that are different to the rest in terms of
the similarity metric, but aren't really all that interesting in terms
of the broad picture - I consider them "outliers" or "noise". However,
they are annoying when it comes to splitting up the dendrogram,
because I'm mainly interested in the reclassification of large areas
of ocean at each step, rather than changes in the similarity.

The dynamic tree-cut approach looks like a promising and sensible
solution to the problem - I'll see if I can get something out of it.

However,  this discussion has started me wondering how I can use the
spatial proximity of the pixels in the analysis - does anyone have any
insights? Can the WGCNA approach be used in such a context?

Best wishes,

Mark Payne

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Manually modifying an hclust dendrogram to remove singletons

r-help.20.trevva
Hi all,

Thanks for the replies - they have helped shaped my thinking and are
starting to push me in a better direction. Maybe I should explain a
little more about what I'm trying to achieve.

I am analysing satellite data across the global ocean, and am
interested in trying to classify areas of the ocean according to the
similarity between the pixels. Singletons in this case therefore
represent individual pixels that are different to the rest in terms of
the similarity metric, but aren't really all that interesting in terms
of the broad picture - I consider them "outliers" or "noise". However,
they are annoying when it comes to splitting up the dendrogram,
because I'm mainly interested in the reclassification of large areas
of ocean at each step, rather than changes in the similarity.

The dynamic tree-cut approach looks like a promising and sensible
solution to the problem - I'll see if I can get something out of it.

However,  this discussion has started me wondering how I can use the
spatial proximity of the pixels in the analysis - does anyone have any
insights? Can the WGCNA approach be used in such a context?

Best wishes,

Mark Payne

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...