Fwd: problem with kmeans

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: problem with kmeans

cassie jones
Dear R-users,

I am trying to run kmeans on a set comprising of 100 observations. But R
somehow can not figure out the true underlying groups, although other
software such as Jmp, MINITAB are producing the desired result.

Following is a brief example of what I am doing.

library(stringdist)
test=c('hematolgy','hemtology','oncology','onclogy',
'oncolgy','dermatolgy','dermatoloy','dematology',
'neurolog','nerology','neurolgy','nerology')

dis=stringdistmatrix(test,test, method = "lv")

set.seed(123)
cl=kmeans(dis,4)


grp_cl=vector('list',4)

for(i in 1:4)
{
    grp_cl[[i]]=test[which(cl$cluster==i)]
}
grp_cl

[[1]]
[1] "oncology" "onclogy"

[[2]]
[1] "neurolog" "nerology" "neurolgy" "nerology"

[[3]]
[1] "oncolgy"

[[4]]
[1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy" "dematology"

In the above example, the 'test' variable consists of a set of
terminologies with various typos and I am trying to group the similar types
of words based on their string distance. Unfortunately kmeans is not able
to replicate the following result that the other software are able to
produce.
[[1]]
[1] "oncology" "onclogy"  "oncolgy"

[[2]]
[1] "neurolog" "nerology" "neurolgy" "nerology"

[[3]]
[1] "dermatolgy" "dermatoloy" "dematology"

[[4]]
[1] "hematolgy"  "hemtology"


Does anyone know if there is a way out, I have heard from a lot of people
that multivariate analysis in R does not produce the desired result most of
the time. Any help is really appreciated.


Thanks in advance.


Cassie

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: problem with kmeans

Ranjan Maitra-2
Cassie,

I am sorry but do you even know what k-means does? That it is a locally
optimal algorithm. That different software implement the same algorithm
differently.

FYI, R uses the Hartigan-Wong (1979) algorithm by default, which is
probably the most efficient out there.

I suggest you first go to a multivariate statistics class before
passing such sweeping statements. (Btw, did these same "some people"
tell you that most other software do not provide the kinds of broad
abilities which R provides, and therefore are not even comparable.)

And then, please read the help function for how to "improve" your run
of k-means using R.  

HTH,
Ranjan


On Tue, 29 Apr 2014 09:45:18 +0530 cassie jones
<[hidden email]> wrote:

> Dear R-users,
>
> I am trying to run kmeans on a set comprising of 100 observations. But R
> somehow can not figure out the true underlying groups, although other
> software such as Jmp, MINITAB are producing the desired result.
>
> Following is a brief example of what I am doing.
>
> library(stringdist)
> test=c('hematolgy','hemtology','oncology','onclogy',
> 'oncolgy','dermatolgy','dermatoloy','dematology',
> 'neurolog','nerology','neurolgy','nerology')
>
> dis=stringdistmatrix(test,test, method = "lv")
>
> set.seed(123)
> cl=kmeans(dis,4)
>
>
> grp_cl=vector('list',4)
>
> for(i in 1:4)
> {
>     grp_cl[[i]]=test[which(cl$cluster==i)]
> }
> grp_cl
>
> [[1]]
> [1] "oncology" "onclogy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "oncolgy"
>
> [[4]]
> [1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy" "dematology"
>
> In the above example, the 'test' variable consists of a set of
> terminologies with various typos and I am trying to group the similar types
> of words based on their string distance. Unfortunately kmeans is not able
> to replicate the following result that the other software are able to
> produce.
> [[1]]
> [1] "oncology" "onclogy"  "oncolgy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "dermatolgy" "dermatoloy" "dematology"
>
> [[4]]
> [1] "hematolgy"  "hemtology"
>
>
> Does anyone know if there is a way out, I have heard from a lot of people
> that multivariate analysis in R does not produce the desired result most of
> the time. Any help is really appreciated.
>
>
> Thanks in advance.
>
>
> Cassie
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. Please respond to the mailing list if appropriate.
For those needing to send personal or professional e-mail, please use
appropriate addresses.

____________________________________________________________
FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: problem with kmeans

plangfelder
In reply to this post by cassie jones
You are using the wrong algorithm. You want Partitioning around
Medoids (PAM, function pam), not k-means. PAM is also known as
k-medoids, which is where the confusion may come from.

use

library(cluster)

cl = pam(dis, 4)

and see if you get what you want.

HTH,

Peter



On Mon, Apr 28, 2014 at 9:15 PM, cassie jones <[hidden email]> wrote:

> Dear R-users,
>
> I am trying to run kmeans on a set comprising of 100 observations. But R
> somehow can not figure out the true underlying groups, although other
> software such as Jmp, MINITAB are producing the desired result.
>
> Following is a brief example of what I am doing.
>
> library(stringdist)
> test=c('hematolgy','hemtology','oncology','onclogy',
> 'oncolgy','dermatolgy','dermatoloy','dematology',
> 'neurolog','nerology','neurolgy','nerology')
>
> dis=stringdistmatrix(test,test, method = "lv")
>
> set.seed(123)
> cl=kmeans(dis,4)
>
>
> grp_cl=vector('list',4)
>
> for(i in 1:4)
> {
>     grp_cl[[i]]=test[which(cl$cluster==i)]
> }
> grp_cl
>
> [[1]]
> [1] "oncology" "onclogy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "oncolgy"
>
> [[4]]
> [1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy" "dematology"
>
> In the above example, the 'test' variable consists of a set of
> terminologies with various typos and I am trying to group the similar types
> of words based on their string distance. Unfortunately kmeans is not able
> to replicate the following result that the other software are able to
> produce.
> [[1]]
> [1] "oncology" "onclogy"  "oncolgy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "dermatolgy" "dermatoloy" "dematology"
>
> [[4]]
> [1] "hematolgy"  "hemtology"
>
>
> Does anyone know if there is a way out, I have heard from a lot of people
> that multivariate analysis in R does not produce the desired result most of
> the time. Any help is really appreciated.
>
>
> Thanks in advance.
>
>
> Cassie
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: problem with kmeans

David Carlson
You really should read the instructions before complaining. The
manual page for kmeans clearly states that it works on "a
numeric matrix of data." That is not what you provided. You gave
it a distance matrix. The function pam() will work with a
distance matrix if it is properly labeled as such, but
stringdistmatrix() does not label the output as a distance
matrix:

dis <- stringdistmatrix(test, test, method = "lv")
dis
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[,12]
 [1,]    0    2    6    6    5    2    3    2    6     5     4
5
 [2,]    2    0    4    5    5    4    4    2    4     3     4
3
 [3,]    6    4    0    1    1    7    7    5    5     3     5
3
 [4,]    6    5    1    0    2    7    8    6    6     4     5
4
 [5,]    5    5    1    2    0    6    7    6    6     4     4
4
 [6,]    2    4    7    7    6    0    1    2    7     5     5
5
 [7,]    3    4    7    8    7    1    0    2    6     5     6
5
 [8,]    2    2    5    6    6    2    2    0    5     4     5
4
 [9,]    6    4    5    6    6    7    6    5    0     2     2
2
[10,]    5    3    3    4    4    5    5    4    2     0     2
0
[11,]    4    4    5    5    4    5    6    5    2     2     0
2
[12,]    5    3    3    4    4    5    5    4    2     0     2
0

require(cluster) # Works once you have installed it.

cl <- pam(dis, 4, diss=TRUE) # Note you must tell pam() that
this is a distance matrix.

print(paste(test, "-", cl$clustering))
 [1] "hematolgy - 1"  "hemtology - 1"  "oncology - 2"   "onclogy
- 2"  
 [5] "oncolgy - 2"    "dermatolgy - 3" "dermatoloy - 3"
"dematology - 1"
 [9] "neurolog - 4"   "nerology - 4"   "neurolgy - 4"
"nerology - 4"

The only apparent error is dermatology which is combined with
hematology but if you look at row 8 of the above distance
matrix, you will see that the Levenshtein distance (the option
you chose) has the value 2 for hematology, hemtology,
dermatolgy, and dermatology. You may want to choose a distance
metric that places greater weight on the initial letter.

Peer reviewed research publications, as opposed to idle gossip,
confirm the accuracy of R.


-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Peter
Langfelder
Sent: Monday, April 28, 2014 11:44 PM
To: cassie jones
Cc: [hidden email]
Subject: Re: [R] Fwd: problem with kmeans

You are using the wrong algorithm. You want Partitioning around
Medoids (PAM, function pam), not k-means. PAM is also known as
k-medoids, which is where the confusion may come from.

use

library(cluster)

cl = pam(dis, 4)

and see if you get what you want.

HTH,

Peter



On Mon, Apr 28, 2014 at 9:15 PM, cassie jones
<[hidden email]> wrote:
> Dear R-users,
>
> I am trying to run kmeans on a set comprising of 100
observations. But R
> somehow can not figure out the true underlying groups,
although other
> software such as Jmp, MINITAB are producing the desired
result.

>
> Following is a brief example of what I am doing.
>
> library(stringdist)
> test=c('hematolgy','hemtology','oncology','onclogy',
> 'oncolgy','dermatolgy','dermatoloy','dematology',
> 'neurolog','nerology','neurolgy','nerology')
>
> dis=stringdistmatrix(test,test, method = "lv")
>
> set.seed(123)
> cl=kmeans(dis,4)
>
>
> grp_cl=vector('list',4)
>
> for(i in 1:4)
> {
>     grp_cl[[i]]=test[which(cl$cluster==i)]
> }
> grp_cl
>
> [[1]]
> [1] "oncology" "onclogy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "oncolgy"
>
> [[4]]
> [1] "hematolgy"  "hemtology"  "dermatolgy" "dermatoloy"
"dematology"
>
> In the above example, the 'test' variable consists of a set of
> terminologies with various typos and I am trying to group the
similar types
> of words based on their string distance. Unfortunately kmeans
is not able
> to replicate the following result that the other software are
able to

> produce.
> [[1]]
> [1] "oncology" "onclogy"  "oncolgy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "dermatolgy" "dermatoloy" "dematology"
>
> [[4]]
> [1] "hematolgy"  "hemtology"
>
>
> Does anyone know if there is a way out, I have heard from a
lot of people
> that multivariate analysis in R does not produce the desired
result most of

> the time. Any help is really appreciated.
>
>
> Thanks in advance.
>
>
> Cassie
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.