# Trouble with (Very) Simple Clustering

 Classic List Threaded
2 messages
Reply | Threaded
Open this post in threaded view
|

## Trouble with (Very) Simple Clustering

 Dear All, I am doing something extremly basic (and I do not claim at all there is no other way to achieve the same): I have a list of numbers and I would like to split them up into clusters. This is what I do: I see each number as a 1D vector and I calculate the euclidean distance between them. I get a distance matrix which I then feed to a hierarchical clustering algorithm. For instance consider the following snippet ######################################################### data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882, 29.2679787220381, 25.5125237513957, 22.9052912406594, 21.3890604699407, 15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675, 6.51469907237986, 5.42861828441895, 4.61986804112007, 4.33660782487196, 3.89915821225882, 3.67394875259037, 2.32719820674605, 1.88489249113792, 1.62276579528843, 1.56048239182126, 1.49722163565454, 1.32492151010636, 1.28216249552147, 1.272235253501, 0.734274800585336, 0.326949583587343, 0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE", "LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT", "CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI", "FR", "DE", "IT", "HU", "BE"), NULL)) distMatrix <- dist(data_mat) n_clus<-5 ## I arbitrarily choose to have 5 clusters hc <- hclust(distMatrix , method="ward.D2") groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters pdf("cluster1.pdf") plot(hc, labels = , hang = -1, main="Mobility to Business",  yaxt='n' , ann=FALSE   )   rect.hclust(hc, k=n_clus, border="red")   dev.off() ###################################################### which gives me very reasonable results. Now, I would like to be able to find the optimal number of cluster on the same data. Based on what I found http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/http://www.statmethods.net/advstats/cluster.htmlpvclust is a sensible way to go. However, when I try to use it on my data, I get an error > fit <- pvclust(t(data_mat), > method.hclust="ward.D2",method.dist="euclidean") Error in FUN(X[[i]], ...) : invalid scale parameter(r) does anybody understand what is my mistake? Many thanks Lorenzo ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

## Re: Trouble with (Very) Simple Clustering

 I think your problem is that pvclust looks for clusters between variables and you have only one variable. When you transpose data_mat, you have a single row and dist cannot calculate a distance matrix on a single row: > dist(t(data_mat)) dist(0) I was going to suggest package NbClust since there is no need to transpose the data, but it fails as well. I did discover that Mclust() in package mclust works: > library(mclust) > Mclust(data_mat) 'Mclust' model object:  best model: univariate, unequal variance (V) with 3 components Looking at the density plot suggests 3 groups as well: > plot(density(data_mat)) ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:[hidden email]] On Behalf Of Lorenzo Isella Sent: Monday, June 6, 2016 10:08 AM To: [hidden email] Subject: [R] Trouble with (Very) Simple Clustering Dear All, I am doing something extremly basic (and I do not claim at all there is no other way to achieve the same): I have a list of numbers and I would like to split them up into clusters. This is what I do: I see each number as a 1D vector and I calculate the euclidean distance between them. I get a distance matrix which I then feed to a hierarchical clustering algorithm. For instance consider the following snippet ######################################################### data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882, 29.2679787220381, 25.5125237513957, 22.9052912406594, 21.3890604699407, 15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675, 6.51469907237986, 5.42861828441895, 4.61986804112007, 4.33660782487196, 3.89915821225882, 3.67394875259037, 2.32719820674605, 1.88489249113792, 1.62276579528843, 1.56048239182126, 1.49722163565454, 1.32492151010636, 1.28216249552147, 1.272235253501, 0.734274800585336, 0.326949583587343, 0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE", "LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT", "CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI", "FR", "DE", "IT", "HU", "BE"), NULL)) distMatrix <- dist(data_mat) n_clus<-5 ## I arbitrarily choose to have 5 clusters hc <- hclust(distMatrix , method="ward.D2") groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters pdf("cluster1.pdf") plot(hc, labels = , hang = -1, main="Mobility to Business",  yaxt='n' , ann=FALSE   )   rect.hclust(hc, k=n_clus, border="red")   dev.off() ###################################################### which gives me very reasonable results. Now, I would like to be able to find the optimal number of cluster on the same data. Based on what I found http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/http://www.statmethods.net/advstats/cluster.htmlpvclust is a sensible way to go. However, when I try to use it on my data, I get an error > fit <- pvclust(t(data_mat), > method.hclust="ward.D2",method.dist="euclidean") Error in FUN(X[[i]], ...) : invalid scale parameter(r) does anybody understand what is my mistake? Many thanks Lorenzo ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.