You really should read the instructions before complaining. The

manual page for kmeans clearly states that it works on "a

numeric matrix of data." That is not what you provided. You gave

it a distance matrix. The function pam() will work with a

distance matrix if it is properly labeled as such, but

stringdistmatrix() does not label the output as a distance

matrix:

dis <- stringdistmatrix(test, test, method = "lv")

dis

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]

[,12]

[1,] 0 2 6 6 5 2 3 2 6 5 4

5

[2,] 2 0 4 5 5 4 4 2 4 3 4

3

[3,] 6 4 0 1 1 7 7 5 5 3 5

3

[4,] 6 5 1 0 2 7 8 6 6 4 5

4

[5,] 5 5 1 2 0 6 7 6 6 4 4

4

[6,] 2 4 7 7 6 0 1 2 7 5 5

5

[7,] 3 4 7 8 7 1 0 2 6 5 6

5

[8,] 2 2 5 6 6 2 2 0 5 4 5

4

[9,] 6 4 5 6 6 7 6 5 0 2 2

2

[10,] 5 3 3 4 4 5 5 4 2 0 2

0

[11,] 4 4 5 5 4 5 6 5 2 2 0

2

[12,] 5 3 3 4 4 5 5 4 2 0 2

0

require(cluster) # Works once you have installed it.

cl <- pam(dis, 4, diss=TRUE) # Note you must tell pam() that

this is a distance matrix.

print(paste(test, "-", cl$clustering))

[1] "hematolgy - 1" "hemtology - 1" "oncology - 2" "onclogy

- 2"

[5] "oncolgy - 2" "dermatolgy - 3" "dermatoloy - 3"

"dematology - 1"

[9] "neurolog - 4" "nerology - 4" "neurolgy - 4"

"nerology - 4"

The only apparent error is dermatology which is combined with

hematology but if you look at row 8 of the above distance

matrix, you will see that the Levenshtein distance (the option

you chose) has the value 2 for hematology, hemtology,

dermatolgy, and dermatology. You may want to choose a distance

metric that places greater weight on the initial letter.

Peer reviewed research publications, as opposed to idle gossip,

confirm the accuracy of R.

-----Original Message-----

From:

[hidden email]
[mailto:

[hidden email]] On Behalf Of Peter

Langfelder

Sent: Monday, April 28, 2014 11:44 PM

To: cassie jones

Cc:

[hidden email]
Subject: Re: [R] Fwd: problem with kmeans

You are using the wrong algorithm. You want Partitioning around

Medoids (PAM, function pam), not k-means. PAM is also known as

k-medoids, which is where the confusion may come from.

use

library(cluster)

cl = pam(dis, 4)

and see if you get what you want.

HTH,

Peter

On Mon, Apr 28, 2014 at 9:15 PM, cassie jones

<

[hidden email]> wrote:

> Dear R-users,

>

> I am trying to run kmeans on a set comprising of 100

observations. But R

> somehow can not figure out the true underlying groups,

although other

> software such as Jmp, MINITAB are producing the desired

result.

>

> Following is a brief example of what I am doing.

>

> library(stringdist)

> test=c('hematolgy','hemtology','oncology','onclogy',

> 'oncolgy','dermatolgy','dermatoloy','dematology',

> 'neurolog','nerology','neurolgy','nerology')

>

> dis=stringdistmatrix(test,test, method = "lv")

>

> set.seed(123)

> cl=kmeans(dis,4)

>

>

> grp_cl=vector('list',4)

>

> for(i in 1:4)

> {

> grp_cl[[i]]=test[which(cl$cluster==i)]

> }

> grp_cl

>

> [[1]]

> [1] "oncology" "onclogy"

>

> [[2]]

> [1] "neurolog" "nerology" "neurolgy" "nerology"

>

> [[3]]

> [1] "oncolgy"

>

> [[4]]

> [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy"

"dematology"

>

> In the above example, the 'test' variable consists of a set of

> terminologies with various typos and I am trying to group the

similar types

> of words based on their string distance. Unfortunately kmeans

is not able

> to replicate the following result that the other software are

able to

> produce.

> [[1]]

> [1] "oncology" "onclogy" "oncolgy"

>

> [[2]]

> [1] "neurolog" "nerology" "neurolgy" "nerology"

>

> [[3]]

> [1] "dermatolgy" "dermatoloy" "dematology"

>

> [[4]]

> [1] "hematolgy" "hemtology"

>

>

> Does anyone know if there is a way out, I have heard from a

lot of people

> that multivariate analysis in R does not produce the desired

result most of

