

Hi All,
I was using the following command for performing kmeans for Iris dataset.
Kmeans_model<kmeans(dataFrame[,c(1,2,3,4)],centers=3)
This was giving proper results for me. But, in my application we generate the R commands dynamically and there was a requirement that the column names will be sent instead of column indices to the R commands.Hence, to incorporate this, i tried using the R commands in the following way.
kmeans_model<kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)
or
kmeans_model<kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3)
In both the ways, we found that the results are different from what we saw with the first command (with column indices).
can you please let us know what is going wrong here.If so, can you please let us know how the column names can be used in kmeans to obtain the correct results?
Many thanks,
Raji


I'm not going to comment on column names, but this is just to make you
aware that the results of kmeans depend on random initialisation.
This means that it is possible that you get different results if you run
it several times. It basically gives you a local optimum and there may be
more than one of these.
Use set.seed to see whether this explains your problem.
Best regards,
Christian
On Wed, 6 Apr 2011, Raji wrote:
> Hi All,
>
> I was using the following command for performing kmeans for Iris dataset.
>
> Kmeans_model<kmeans(dataFrame[,c(1,2,3,4)],centers=3)
>
> This was giving proper results for me. But, in my application we generate
> the R commands dynamically and there was a requirement that the column names
> will be sent instead of column indices to the R commands.Hence, to
> incorporate this, i tried using the R commands in the following way.
>
> kmeans_model<kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)
>
> or
>
> kmeans_model<kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3)
>
> In both the ways, we found that the results are different from what we saw
> with the first command (with column indices).
>
> can you please let us know what is going wrong here.If so, can you please
> let us know how the column names can be used in kmeans to obtain the correct
> results?
>
> Many thanks,
> Raji
>
> 
> View this message in context: http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3430433.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
***  ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[hidden email], www.homepages.ucl.ac.uk/~ucakche
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi,
Thanks for the information.But , i am already using set.seed().My problem
is that, when i use column names instead of column indices, the result seems
to be less accurate consistently.Hence, we wanted to understand how kmeans
differentiates between column names and column indices. Is there any way we
can bridge the gap so that we get the same result for column names and
column indices?
Regards,
Raji
On Wed, Apr 6, 2011 at 5:30 PM, Christian Hennig < [hidden email]>wrote:
> I'm not going to comment on column names, but this is just to make you
> aware that the results of kmeans depend on random initialisation.
>
> This means that it is possible that you get different results if you run it
> several times. It basically gives you a local optimum and there may be more
> than one of these.
> Use set.seed to see whether this explains your problem.
>
> Best regards,
> Christian
>
>
> On Wed, 6 Apr 2011, Raji wrote:
>
> Hi All,
>>
>> I was using the following command for performing kmeans for Iris dataset.
>>
>> Kmeans_model<kmeans(dataFrame[,c(1,2,3,4)],centers=3)
>>
>> This was giving proper results for me. But, in my application we generate
>> the R commands dynamically and there was a requirement that the column
>> names
>> will be sent instead of column indices to the R commands.Hence, to
>> incorporate this, i tried using the R commands in the following way.
>>
>>
>> kmeans_model<kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)
>>
>> or
>>
>>
>> kmeans_model<kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3)
>>
>> In both the ways, we found that the results are different from what we saw
>> with the first command (with column indices).
>>
>> can you please let us know what is going wrong here.If so, can you please
>> let us know how the column names can be used in kmeans to obtain the
>> correct
>> results?
>>
>> Many thanks,
>> Raji
>>
>> 
>> View this message in context:
>> http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3430433.html>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide
>> http://www.Rproject.org/postingguide.html< http://www.rproject.org/postingguide.html>
>> and provide commented, minimal, selfcontained, reproducible code.
>>
>>
> ***  ***
> Christian Hennig
> University College London, Department of Statistical Science
> Gower St., London WC1E 6BT, phone +44 207 679 1698
> [hidden email], www.homepages.ucl.ac.uk/~ucakche
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi,
I have herewith attached the results of the 2 commands.
> *set.seed(1234)
>
kmeans_model<kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)
> kmeans_model$cluster
* [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2
[101] 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2
3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
> *kmeansM<kmeans(dataFrame[,c(1,2,3,4)],centers=3)
> kmeansM$cluster
* 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
38 39 40 41 42 43 44 45 46 47 48 49 50 51
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 3
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99 100 101 102
3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 2 3
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
141 142 143 144 145 146 147 148 149 150
2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3
2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3
2 2 2 3 2 2 2 3 2 2 3
We can notice that, the first one is less accurate that the second
results.Can you please let me know how i can get the first command with
column names to get the result of the second one?
Many thanks.
Regards,
Raji
On Thu, Apr 7, 2011 at 4:02 AM, raji sankaran < [hidden email]>wrote:
> Hi,
>
> Thanks for the information.But , i am already using set.seed().My problem
> is that, when i use column names instead of column indices, the result seems
> to be less accurate consistently.Hence, we wanted to understand how kmeans
> differentiates between column names and column indices. Is there any way we
> can bridge the gap so that we get the same result for column names and
> column indices?
>
> Regards,
> Raji
>
> On Wed, Apr 6, 2011 at 5:30 PM, Christian Hennig < [hidden email]
> > wrote:
>
>> I'm not going to comment on column names, but this is just to make you
>> aware that the results of kmeans depend on random initialisation.
>>
>> This means that it is possible that you get different results if you run
>> it several times. It basically gives you a local optimum and there may be
>> more than one of these.
>> Use set.seed to see whether this explains your problem.
>>
>> Best regards,
>> Christian
>>
>>
>> On Wed, 6 Apr 2011, Raji wrote:
>>
>> Hi All,
>>>
>>> I was using the following command for performing kmeans for Iris
>>> dataset.
>>>
>>> Kmeans_model<kmeans(dataFrame[,c(1,2,3,4)],centers=3)
>>>
>>> This was giving proper results for me. But, in my application we generate
>>> the R commands dynamically and there was a requirement that the column
>>> names
>>> will be sent instead of column indices to the R commands.Hence, to
>>> incorporate this, i tried using the R commands in the following way.
>>>
>>>
>>> kmeans_model<kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)
>>>
>>> or
>>>
>>>
>>> kmeans_model<kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3)
>>>
>>> In both the ways, we found that the results are different from what we
>>> saw
>>> with the first command (with column indices).
>>>
>>> can you please let us know what is going wrong here.If so, can you
>>> please
>>> let us know how the column names can be used in kmeans to obtain the
>>> correct
>>> results?
>>>
>>> Many thanks,
>>> Raji
>>>
>>> 
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3430433.html>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>> PLEASE do read the posting guide
>>> http://www.Rproject.org/postingguide.html< http://www.rproject.org/postingguide.html>
>>> and provide commented, minimal, selfcontained, reproducible code.
>>>
>>>
>> ***  ***
>> Christian Hennig
>> University College London, Department of Statistical Science
>> Gower St., London WC1E 6BT, phone +44 207 679 1698
>> [hidden email], www.homepages.ucl.ac.uk/~ucakche
>>
>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi All,
For executing kmeans for Iris, we found that there were 2 different ways.
dataFrame < read.csv("c:/Iris.csv",header=T)
1. kmeans_model<kmeans(dataFrame[1:5],size=3)
This gave an error as it had Species which is a String column as one of the inputs
2.attach(dataFrame)
kmeans_model<kmeans(cbind(SepalLength,SepalWidth,PetalLength,PetalWidth,Species),3)
But this command worked and gave output.
Does this mean that kmeans can accept String inputs also?
Can you please let me know how the second command works?
Thanks in advance.
Regards,
Raji


Hi,
I suspect your column Species is of class "factor" (as it is in R's built in
iris dataset).
This means that in your case Species is an integer vector with the
additional information of the levels names. kmeans is internally calling
as.matrix(), which creates a character matrix of your dataframe, because one
column is factor and you get an error.
After binding the columns with cbind, the result is an integer matrix with
the Species columns as the internal levels (1,2 and 3 instead of "setosa"
"versicolor" "virginica" ) and kmeans is not throwing a error any more.
Furthermore kmeans wouldn't work in the first case, because there is no
"size="  argument in kmeans. You probably meant centers=3.
For additional information try ?kmeans
Christoph
2011/10/16 Raji < [hidden email]>
> Hi All,
>
> For executing kmeans for Iris, we found that there were 2 different ways.
>
> dataFrame < read.csv("c:/Iris.csv",header=T)
>
> 1. kmeans_model<kmeans(dataFrame[1:5],size=3)
> *This gave an error as it had Species which is a String column as one of
> the inputs*
>
> 2.attach(dataFrame)
>
>
> kmeans_model<kmeans(cbind(SepalLength,SepalWidth,PetalLength,PetalWidth,Species),3)
>
> * But this command worked and gave output.*
>
> Does this mean that kmeans can accept String inputs also?
>
> Can you please let me know how the second command works?
>
> Thanks in advance.
>
> Regards,
> Raji
>
> 
> View this message in context:
> http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3909552.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi,
Thank you .. The information was very helpful.
Yes.It was meant to be centers=3.Even with that , kmeans gives error if we
give the index of Species columns.
So, *is it ok to use kmeans for String data by using cbind*.But,
kmeans*works even if we give a column which contains distinct String
values
*.
For example,a column which contains names like country names.How does this
work in such cases? Is it expected behavior?
Country

England
Germany
China
Thanks,
Raji
On Mon, Oct 17, 2011 at 1:02 AM, Christoph Molnar <
[hidden email]> wrote:
> Hi,
>
> I suspect your column Species is of class "factor" (as it is in R's built
> in iris dataset).
> This means that in your case Species is an integer vector with the
> additional information of the levels names. kmeans is internally calling
> as.matrix(), which creates a character matrix of your dataframe, because one
> column is factor and you get an error.
>
> After binding the columns with cbind, the result is an integer matrix with
> the Species columns as the internal levels (1,2 and 3 instead of "setosa"
> "versicolor" "virginica" ) and kmeans is not throwing a error any more.
>
> Furthermore kmeans wouldn't work in the first case, because there is no
> "size="  argument in kmeans. You probably meant centers=3.
> For additional information try ?kmeans
>
> Christoph
>
>
> 2011/10/16 Raji < [hidden email]>
>
>> Hi All,
>>
>> For executing kmeans for Iris, we found that there were 2 different ways.
>>
>> dataFrame < read.csv("c:/Iris.csv",header=T)
>>
>> 1. kmeans_model<kmeans(dataFrame[1:5],size=3)
>> *This gave an error as it had Species which is a String column as one of
>> the inputs*
>>
>> 2.attach(dataFrame)
>>
>>
>> kmeans_model<kmeans(cbind(SepalLength,SepalWidth,PetalLength,PetalWidth,Species),3)
>>
>> * But this command worked and gave output.*
>>
>> Does this mean that kmeans can accept String inputs also?
>>
>> Can you please let me know how the second command works?
>>
>> Thanks in advance.
>>
>> Regards,
>> Raji
>>
>> 
>> View this message in context:
>> http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3909552.html>>
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide
>> http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>>
>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi,
no, don't use kmeans with factors.
The kmeans algorithm does, besides other things, calculate the mean of the k
clusters.
But you don't get a useful mean from factors, because the internally used
integers are arbitrary. In this case its 1,2 and 3. But it could be 42, 7
and 100000 as well, which would change any calculation of a mean.
Thats why the kmeans() function wants numeric matrices.
Maybe you should think about how kmeans works:
http://en.wikipedia.org/wiki/Kmeans_clusteringChristoph
2011/10/16 raji sankaran < [hidden email]>
> Hi,
>
> Thank you .. The information was very helpful.
>
> Yes.It was meant to be centers=3.Even with that , kmeans gives error if we
> give the index of Species columns.
>
> So, *is it ok to use kmeans for String data by using cbind*.But, kmeans*works even if we give a column which contains distinct String values
> *.
> For example,a column which contains names like country names.How does this
> work in such cases? Is it expected behavior?
>
> Country
> 
> England
> Germany
> China
>
> Thanks,
> Raji
>
>
> On Mon, Oct 17, 2011 at 1:02 AM, Christoph Molnar <
> [hidden email]> wrote:
>
>> Hi,
>>
>> I suspect your column Species is of class "factor" (as it is in R's built
>> in iris dataset).
>> This means that in your case Species is an integer vector with the
>> additional information of the levels names. kmeans is internally calling
>> as.matrix(), which creates a character matrix of your dataframe, because one
>> column is factor and you get an error.
>>
>> After binding the columns with cbind, the result is an integer matrix with
>> the Species columns as the internal levels (1,2 and 3 instead of "setosa"
>> "versicolor" "virginica" ) and kmeans is not throwing a error any more.
>>
>> Furthermore kmeans wouldn't work in the first case, because there is no
>> "size="  argument in kmeans. You probably meant centers=3.
>> For additional information try ?kmeans
>>
>> Christoph
>>
>>
>> 2011/10/16 Raji < [hidden email]>
>>
>>> Hi All,
>>>
>>> For executing kmeans for Iris, we found that there were 2 different
>>> ways.
>>>
>>> dataFrame < read.csv("c:/Iris.csv",header=T)
>>>
>>> 1. kmeans_model<kmeans(dataFrame[1:5],size=3)
>>> *This gave an error as it had Species which is a String column as one
>>> of
>>> the inputs*
>>>
>>> 2.attach(dataFrame)
>>>
>>>
>>> kmeans_model<kmeans(cbind(SepalLength,SepalWidth,PetalLength,PetalWidth,Species),3)
>>>
>>> * But this command worked and gave output.*
>>>
>>> Does this mean that kmeans can accept String inputs also?
>>>
>>> Can you please let me know how the second command works?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Raji
>>>
>>> 
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/Helpinkmeanstp3430433p3909552.html>>>
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rhelp>>> PLEASE do read the posting guide
>>> http://www.Rproject.org/postingguide.html>>> and provide commented, minimal, selfcontained, reproducible code.
>>>
>>
>>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

