Advice on exploration of sub-clusters in hierarchical dendrogram

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Advice on exploration of sub-clusters in hierarchical dendrogram

kosmo7
Dear R user,

I am a biochemist/bioinformatician, at the moment working on protein clusterings by conformation similarity.

I only started seriously working with R about a couple of months ago.
I have been able so far to read my way through tutorials and set-up my hierarchical clusterings. My problem is that I cannot find a way to obtain information on the rooting of specific nodes, i.e. of specific clusters of interest.
In other words, I am trying to obtain/read the sub-clusters of a specific cluster in the dendrogram, by isolating a specific node and exploring locally its lower hierarchy.

Please allow me to display some of the code I have been using for your reference:

df=read.table('mydata.txt', head=T, row.names=1) #read file with distance matrix
d=as.dist(df) #format table as distance matrix
z<-hclust(d,method="complete", members=NULL)
x<-as.dendrogram(z)
plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the dendrogram
clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6
ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2 dimensions
clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0) #visualization of the clusters in 2D map
var1<-var(clusters==1) #variance of cluster 1

#extract cluster memberships:
clids = as.data.frame(clusters)
names(clids) = c("id")
clids$cdr = row.names(clids)
row.names(clids) = c(1:dim(clids)[1])
clstructure = lapply(unique(clids$id), function(x){clids[clids$id == x,'cdr']})

clstructure[[1]] #get memberships of cluster 1



From this point, eventually, I could recreate a distance matrix with only the members of a specific cluster and then re-apply hierarchical clustering and start all over again.
But this would take me ages to perform individually for hundred of clusters. So, I was hoping if anyone could point me to a direction as to how to take advantage of the initial dendrogram and focus on specific clusters from which to derive the sub-clusters at a new given cutoff height.

I recently found in this page http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual

the following code:
clid <- c(1,2)
ysub <- y[names(mycl[mycl%in%clid]),]
hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")), method="complete") # Select sub-cluster number (here: clid=c(1,2)) and generate corresponding dendrogram.

Even with this given example I am afraid I can't work my way around.
So I guess in my case I could grab all the members of a specific cluster using my existing code and try to reformat the distance matrix in one that only contains the distances of those members:
cluster1members<-clstructure[[1]]

Then I need to reformat the distance matrix into a new one, say d1, which I can feed to a new -local- hierarchical clustering:
hrsub<-hclust(d1, method="complete")

Any ideas on how I can obtain a new distance matrix with just the distances of the members in that clusters, with names contained in vector "cluster1members" ?

Apologies if this seems trivial, but I really can't find the correct functions to use for this task.
Thank you very much in advance - as I am really a novice with R, small chunks of code as example would be of great help.

Take care all -
Reply | Threaded
Open this post in threaded view
|

Re: Advice on exploration of sub-clusters in hierarchical dendrogram

ilai-2
See inline

On Thu, Feb 23, 2012 at 8:54 AM, kosmo7 <[hidden email]> wrote:
> Dear R user,

> In other words, I am trying to obtain/read the sub-clusters of a specific
> cluster in the dendrogram, by isolating a specific node and exploring
> locally its lower hierarchy.

To explore or "zoom in" on elements of z you had the first step right:
create x<-as.dendrogram(z) but then you didn't use x anymore (except
for the plot which could have been done on z). Maybe you wanted:

> df=read.table('mydata.txt', head=T, row.names=1) #read file with distance
> matrix
> d=as.dist(df) #format table as distance matrix
> z<-hclust(d,method="complete", members=NULL)
> x<-as.dendrogram(z)
> plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the
> dendrogram

>From this point

clusters<-cut(x, h=1.6) #obtain clusters at cutoff height=1.6

# clusters is now (after cut x not cutree z) a list of two components:
upper and lower. Each is in itself a list of dendrograms: the
structure above 1.6, and the local clusters below:

plot(clusters$upper)  # the structure above 1.6
plot(clusters$lower[[1]])  # cluster 1

# To print the details of cluster 1 (this output maybe very long
depending on how many members):

str(clusters$lower[[1]])

To extract specific details from the list and automate for all or some
of the clusters ?dendrapply is your friend.

I'm assuming your attempts at reclustering locally later in your post
are no longer necessary, unless I'm missing something on what exactly
you are trying to do.

Hope this helps

Elai



> ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2
> dimensions
> clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0)
> #visualization of the clusters in 2D map
> var1<-var(clusters==1) #variance of cluster 1
>
> #extract cluster memberships:
> clids = as.data.frame(clusters)
> names(clids) = c("id")
> clids$cdr = row.names(clids)
> row.names(clids) = c(1:dim(clids)[1])
> clstructure = lapply(unique(clids$id), function(x){clids[clids$id ==
> x,'cdr']})
>
> clstructure[[1]] #get memberships of cluster 1
>
>
>
> >From this point, eventually, I could recreate a distance matrix with only
> the members of a specific cluster and then re-apply hierarchical clustering
> and start all over again.
> But this would take me ages to perform individually for hundred of clusters.
> So, I was hoping if anyone could point me to a direction as to how to take
> advantage of the initial dendrogram and focus on specific clusters from
> which to derive the sub-clusters at a new given cutoff height.
>
> I recently found in this page
> http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
> http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
>
> the following code:
> clid <- c(1,2)
> ysub <- y[names(mycl[mycl%in%clid]),]
> hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")),
> method="complete") # Select sub-cluster number (here: clid=c(1,2)) and
> generate corresponding dendrogram.
>
> Even with this given example I am afraid I can't work my way around.
> So I guess in my case I could grab all the members of a specific cluster
> using my existing code and try to reformat the distance matrix in one that
> only contains the distances of those members:
> cluster1members<-clstructure[[1]]
>
> Then I need to reformat the distance matrix into a new one, say d1, which I
> can feed to a new -local- hierarchical clustering:
> hrsub<-hclust(d1, method="complete")
>
> Any ideas on how I can obtain a new distance matrix with just the distances
> of the members in that clusters, with names contained in vector
> "cluster1members" ?
>
> Apologies if this seems trivial, but I really can't find the correct
> functions to use for this task.
> Thank you very much in advance - as I am really a novice with R, small
> chunks of code as example would be of great help.
>
> Take care all -
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4414277.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Advice on exploration of sub-clusters in hierarchical dendrogram

kosmo7
Dear Elai,
thank you very much for your suggestion. I tried cutting the dendrogram instead of the hclust tree with:
clusters<-cut(x, h=1.6)

but then when I try to call/plot cluster 1 for example, with:
plot(clusters$lower[[1]])

I get only 2 members that are joined together at distance=0  (cluster 1 for instance, consists of several hundred of members).
So it looks like  plot(clusters$lower[[1]]) only calls the very first node of the tree and not the content of the respective cluster [[1]] at the defined cutoff=1.6. Maybe cut instead of cutree doesnt do the work? Or  maybe I am just doing something  wrong?...



In another post I read that with df[value %in% v, ]  I can extract specific subsets of a data frame/table. Maybe I could use this to extract only the distances of members of a specific cluster as defined by cutree from the initial distance matrix? But still, I am afraid I don't get what I should use as value and v....
Reply | Threaded
Open this post in threaded view
|

Re: Advice on exploration of sub-clusters in hierarchical dendrogram

Michael Weylandt
Inline:

On Feb 23, 2012, at 6:20 PM, kosmo7 <[hidden email]> wrote:

> Dear Elai,
> thank you very much for your suggestion. I tried cutting the dendrogram
> instead of the hclust tree with:
> clusters<-cut(x, h=1.6)
>
> but then when I try to call/plot cluster 1 for example, with:
> plot(clusters$lower[[1]])
>
> I get only 2 members that are joined together at distance=0  (cluster 1 for
> instance, consists of several hundred of members).
> So it looks like / plot(clusters$lower[[1]])/ only calls the very first node
> of the tree and not the content of the respective cluster [[1]] at the
> defined cutoff=1.6. Maybe /cut/ instead of /cutree/ doesnt do the work? Or
> maybe I am just doing something  wrong?...
>
>
>
> In another post I read that with /df[value %in% v, ] / I can extract
> specific subsets of a data frame/table.

That was me and there's a slight mistake in that post (corrected by Sarah): should be

df[df$value %in% v, ]

Sorry for any confusion that might have caused

Michael

> Maybe I could use this to extract
> only the distances of members of a specific cluster as defined by cutree
> from the initial distance matrix? But still, I am afraid I don't get what I
> should use as /value/ and /v/....
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4415589.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Advice on exploration of sub-clusters in hierarchical dendrogram

kosmo7
In reply to this post by kosmo7
Ok, I was able to work it out finally.
As I have been aided myself numerous times from posted questions by other users who have reached in the end a solution to their problem, I will put the code that worked for me for future googlers - it is certainly not optimal but it works:

# Initial clustering
df=read.table('mydata.txt', head=T, row.names=1) #read file with distance matrix
d=as.dist(df) #format table as distance matrix
z<-hclust(d,method="complete", members=NULL)
x<-as.dendrogram(z)
plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the dendrogram
clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6
ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2 dimensions
clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0) #visualization of the clusters in 2D map

# Local sub-clustering (actually re-clustering on a specific tree node/cluster)

h<-as.matrix(d)  # transform the distance matrix to a simple matrix. We should ideally  work with the initial data table but  it sometimes contains an "X" letter preceding labels and there is a risk labels aren't recognized by comparison to name vectors. Distance matrices don't contain the preceding "X" so I transformed it back to a simple matrix  (this step might not be required, depending on your initial data table format).

clid<-c(1)  # Just a column containing the number of the clusters of the initial clustering that you want to pick - separate with commas if more than one clusters,. Here we only want cluster 1.
ysub<-h[names(clusters[clusters%in%clid]),]  #Remove all rows from the h table that do not begin by the label of a member of cluster 1
ysub<-t(ysub)[names(clusters[clusters%in%clid]),]  #We want a rectangular table to be used as distance matrix later on, so we transpose the previous table ysub and remove again the unneeded rows.
hrsub<-hclust(as.dist(ysub),method="average") #Perform your preferred hierarchical method on just the initial clusters selected with clid
plot(hrsub)
ord2<-cmdscale(ysub, k=2)
plot(ord2) # Now we can visually "zoom" on the data configuration of just the selected cluster by 2d MDS
aa<-silhouette(cutree(hrsub,h=1.2),as.dist(ysub)) #We can perform silhouette analysis localy on the selected cluster (by clid)
plot(aa)
clusplot(ord2,cutree(hrsub,h=1.2), color=TRUE, shade=TRUE,labels=4, lines=0) # clusterplot of the subclusters


Thanks for reading - take care all.

PS. If anyone can write all these things in a more efficient way, please feel free to add a comment.
Reply | Threaded
Open this post in threaded view
|

Re: Advice on exploration of sub-clusters in hierarchical dendrogram

ilai-2
In reply to this post by Michael Weylandt
Inline:

On Thu, Feb 23, 2012 at 8:23 PM, R. Michael Weylandt
<[hidden email]> <[hidden email]> wrote:

> Inline:
>
> On Feb 23, 2012, at 6:20 PM, kosmo7 <[hidden email]> wrote:
>
>> Dear Elai,
>> thank you very much for your suggestion. I tried cutting the dendrogram
>> instead of the hclust tree with:
>> clusters<-cut(x, h=1.6)
>>
>> but then when I try to call/plot cluster 1 for example, with:
>> plot(clusters$lower[[1]])
>>
>> I get only 2 members that are joined together at distance=0  (cluster 1 for
>> instance, consists of several hundred of members).
>> So it looks like / plot(clusters$lower[[1]])/ only calls the very first node
>> of the tree and not the content of the respective cluster [[1]] at the
>> defined cutoff=1.6.

The "suggestions" in my original post are just pointers to the fact
there are methods for class dendrogram to achieve what you wanted.
Since you got as far as x<-as.dendrogram(z) I assumed that's all you
needed.

Maybe /cut/ instead of /cutree/ doesnt do the work? Or
>> maybe I am just doing something  wrong?...

The examples in ?as.dendrogram and ?dendrapply are self contained,
very clear and straight forward. If you haven't done so already I
suggest you try them. Most likely the problem is in your data
(row.names ? ) or your interpretation of who is "cluster1" or the 1.6
cutoff.

>>
>>
>>
>> In another post I read that with /df[value %in% v, ] / I can extract
>> specific subsets of a data frame/table.
>

Seems I missed some back and forth on this post already, so my
apologies if this is no longer an issue. Personally I find that
because there are many more nodes and info in a tree than rows in the
data set (leaf nodes only) much of the "usual" generic R solutions get
distorted when it comes to trees. Better to use appropriate methods
for the class (dendrapply helps as I've said before).

Hope that helps dig you out of the hole.
Elai


> That was me and there's a slight mistake in that post (corrected by Sarah): should be
>
> df[df$value %in% v, ]
>
> Sorry for any confusion that might have caused
>
> Michael
>> Maybe I could use this to extract
>> only the distances of members of a specific cluster as defined by cutree
>> from the initial distance matrix? But still, I am afraid I don't get what I
>> should use as /value/ and /v/....
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4415589.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.