cluster analysis and supervised classification: an alternative to knn1?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

cluster analysis and supervised classification: an alternative to knn1?

abanero
Hi,
I have a 1.000 observations with 10 attributes (of different types: numeric, dicotomic, categorical  ecc..) and a measure M.

I need to cluster these observations in order to assign a new observation (with the same 10 attributes but not the measure) to a cluster.

I want to calculate for the new observation a measure as the average of the meausures M of the observations in the cluster assigned.

I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in
package class) to assign the new observation to a cluster.

The problem is: I’m not able to use “knn1” because some of attributes are categorical.

Do you know  something like “knn1” that works with categorical variables too? Do you have any suggestion?
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Joris FA Meys
Not a direct answer, but from your description it looks like you are better
of with supervised classification algorithms instead of unsupervised
clustering. see the library randomForest for example. Alternatively, you can
try a logistic regression or a multinomial regression approach, but these
are parametric methods and put requirements on the data. randomForest is
completely non-parametric.

Cheers
Joris

On Wed, May 26, 2010 at 3:45 PM, abanero <[hidden email]> wrote:

>
> Hi,
> I have a 1.000 observations with 10 attributes (of different types:
> numeric,
> dicotomic, categorical  ecc..) and a measure M.
>
> I need to cluster these observations in order to assign a new observation
> (with the same 10 attributes but not the measure) to a cluster.
>
> I want to calculate for the new observation a measure as the average of the
> meausures M of the observations in the cluster assigned.
>
> I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in
> package class) to assign the new observation to a cluster.
>
> The problem is: I’m not able to use “knn1” because some of attributes are
> categorical.
>
> Do you know  something like “knn1” that works with categorical variables
> too? Do you have any suggestion?
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Ulrich Bodenhofer
In reply to this post by abanero
abanero wrote:
>
> Do you know  something like “knn1” that works with categorical variables too?
> Do you have any suggestion?
>
There are surely plenty of clustering algorithms around that do not require a vector space structure on the inputs (like KNN does). I think agglomerative clustering would solve the problem as well as a kernel-based clustering (assuming that you have a way to positive semi-definite measure of the similarity of two samples). Probably the simplest way is Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation; see CRAN package "apcluster" I have co-developed). All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables - as well as for mixtures of both (the choice of the similarity measures and how to aggregate the different variables is left to you, of course). Your final "classification" task can be accomplished simply by assigning the new sample to the cluster whose exemplar is most similar.

Joris Meys wrote:
>
> Not a direct answer, but from your description it looks like you are better
> of with supervised classification algorithms instead of unsupervised
> clustering.
>
If you say that this is a purely supervised task that can be solved without clustering, I disagree. abanero does not mention any class labels. So it seems to me that it is indeed necessary to do unsupervised clustering first. However, I agree that the second task of assigning new samples to clusters/classes/whatever can also be solved by almost any supervised technique if samples are labeled according to their cluster membership first.

Cheers, Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Christian Hennig
Dear abanero,

In principle, k nearest neighbours classification can be computed on
any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume
Euclidean vectors as input, which restricts their use.

I'd probably compute an appropriate dissimilarity between points (have a
look at Gower's distance in daisy, package cluster), and the implement
nearest neighbours classification myself if I needed it. It should be
pretty straightforward to implement.

If you want unsupervised classification (clustering) instead, you have the
choice between all kinds of dissimilarity based algorithms then (hclust, pam,
agnes etc.)

Christian

On Thu, 27 May 2010, Ulrich Bodenhofer wrote:

>
> abanero wrote:
>>
>> Do you know  something like “knn1” that works with categorical variables
>> too?
>> Do you have any suggestion?
>>
> There are surely plenty of clustering algorithms around that do not require
> a vector space structure on the inputs (like KNN does). I think
> agglomerative clustering would solve the problem as well as a kernel-based
> clustering (assuming that you have a way to positive semi-definite measure
> of the similarity of two samples). Probably the simplest way is Affinity
> Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
> see CRAN package "apcluster" I have co-developed). All you need is a way of
> measuring the similarity of samples which is straightforward both for
> numerical and categorical variables - as well as for mixtures of both (the
> choice of the similarity measures and how to aggregate the different
> variables is left to you, of course). Your final "classification" task can
> be accomplished simply by assigning the new sample to the cluster whose
> exemplar is most similar.
>
> Joris Meys wrote:
>>
>> Not a direct answer, but from your description it looks like you are
>> better
>> of with supervised classification algorithms instead of unsupervised
>> clustering.
>>
> If you say that this is a purely supervised task that can be solved without
> clustering, I disagree. abanero does not mention any class labels. So it
> seems to me that it is indeed necessary to do unsupervised clustering first.
> However, I agree that the second task of assigning new samples to
> clusters/classes/whatever can also be solved by almost any supervised
> technique if samples are labeled according to their cluster membership
> first.
>
> Cheers, Ulrich
> --
> View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[hidden email], www.homepages.ucl.ac.uk/~ucakche
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

abanero
In reply to this post by Ulrich Bodenhofer
Hi,

thank you Joris and Ulrich for you answers.

Joris Meys wrote:

>see the library randomForest for example


I'm trying to find some example in randomForest with categorical variables but I haven't found anything. Do you know any example with both categorical and numerical variables? Anyway I don't have any class labels yet. How could  I  find clusters with randomForest?


Ulrich wrote:

>Probably the simplest way is Affinity Propagation[...] All you need is a way of measuring the similarity of >samples which is straightforward both for numerical and categorical variables.

I had a look at the documentation of the package apcluster. That's interesting but do you have any example using it with both categorical and numerical variables? I'd like to test it with a large dataset..

Thanks a lot!
Cheers

Giuseppe
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Joris FA Meys
Hi Abanero,

first, I have to correct myself. Knn1 is a supervised learning algorithm, so
my comment wasn't completely correct. In any case, if you want to do a
clustering prior to a supervised classification, the function daisy() can
handle any kind of variable. The resulting distance matrix can be used with
a number of different methods.

And you're right, randomForest doesn't handle categorical variables either.
So I haven't been of great help here...
Cheers
Joris

On Thu, May 27, 2010 at 1:25 PM, abanero <[hidden email]> wrote:

>
> Hi,
>
> thank you Joris and Ulrich for you answers.
>
> Joris Meys wrote:
>
> >see the library randomForest for example
>
>
> I'm trying to find some example in randomForest with categorical variables
> but I haven't found anything. Do you know any example with both categorical
> and numerical variables? Anyway I don't have any class labels yet. How
> could
> I  find clusters with randomForest?
>
>
> Ulrich wrote:
>
> >Probably the simplest way is Affinity Propagation[...] All you need is a
> way of measuring the similarity of >samples which is straightforward both
> for numerical and categorical variables.
>
> I had a look at the documentation of the package apcluster. That's
> interesting but do you have any example using it with both categorical and
> numerical variables? I'd like to test it with a large dataset..
>
> Thanks a lot!
> Cheers
>
> Giuseppe
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Joris FA Meys
I'm confusing myself :-)

randomForest cannot handle character vectors as predictors. (Which is why I,
to my surprise, found out that a categorical variable could not be used in
the function). It can handle categorical variables as predictors IF they are
put in as a factor.

Obviously they handle categorical variables as a response variable.

 I hope I'm not going to add up more mistakes, it's been enough for the
day...
Cheers
Joris

On Thu, May 27, 2010 at 2:08 PM, <[hidden email]> wrote:

> Joris,
>
> I've been following this thread for a few days as I am beginning to use
> randomForest in my work.  I am confused by your last email.
>
> What do you mean that randomForest does not handle categorical variables ?
>
> It can be used in either regression or classification analysis.  Do you
> mean that categorical predictors are not suitable? Certainly they are as
> the response.
> Would you be so kind, and clarify what you were suggesting.
>
> Thanks,
>
> Steve Friedman Ph. D.
> Spatial Statistical Analyst
> Everglades and Dry Tortugas National Park
> 950 N Krome Ave (3rd Floor)
> Homestead, Florida 33034
>
> [hidden email]
> Office (305) 224 - 4282
> Fax     (305) 224 - 4147
>
>
>
>             Joris Meys
>             <jorismeys@gmail.
>             com>                                                       To
>             Sent by:                  abanero <[hidden email]>
>             r-help-bounces@r-                                          cc
>             project.org               [hidden email]
>                                                                   Subject
>                                       Re: [R] cluster analysis and
>             05/27/2010 07:56          supervised classification: an
>             AM                        alternative to knn1?
>
>
>
>
>
>
>
>
>
>
> Hi Abanero,
>
> first, I have to correct myself. Knn1 is a supervised learning algorithm,
> so
> my comment wasn't completely correct. In any case, if you want to do a
> clustering prior to a supervised classification, the function daisy() can
> handle any kind of variable. The resulting distance matrix can be used with
> a number of different methods.
>
> And you're right, randomForest doesn't handle categorical variables either.
> So I haven't been of great help here...
> Cheers
> Joris
>
> On Thu, May 27, 2010 at 1:25 PM, abanero <[hidden email]> wrote:
>
> >
> > Hi,
> >
> > thank you Joris and Ulrich for you answers.
> >
> > Joris Meys wrote:
> >
> > >see the library randomForest for example
> >
> >
> > I'm trying to find some example in randomForest with categorical
> variables
> > but I haven't found anything. Do you know any example with both
> categorical
> > and numerical variables? Anyway I don't have any class labels yet. How
> > could
> > I  find clusters with randomForest?
> >
> >
> > Ulrich wrote:
> >
> > >Probably the simplest way is Affinity Propagation[...] All you need is a
> > way of measuring the similarity of >samples which is straightforward both
> > for numerical and categorical variables.
> >
> > I had a look at the documentation of the package apcluster. That's
> > interesting but do you have any example using it with both categorical
> and
> > numerical variables? I'd like to test it with a large dataset..
> >
> > Thanks a lot!
> > Cheers
> >
> > Giuseppe
> >
> > --
> > View this message in context:
> >
>
> http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
>
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Joris Meys
> Statistical Consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
>
> Coupure Links 653
> B-9000 Gent
>
> tel : +32 9 264 59 87
> [hidden email]
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>              [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>


--
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
[hidden email]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Ulrich Bodenhofer
In reply to this post by abanero
>
> I had a look at the documentation of the package apcluster.
> That's interesting but do you have any example using it with both categorical
> and numerical variables? I'd like to test it with a large dataset..
>
Your posting has opened my eyes: problems where both numerical and categorical features occur are probably among the most attractive applications of affinity propagation. So I am considering to include such an example in a future released.

Here is a very crude example (download the imports-85.data from http://archive.ics.uci.edu/ml/machine-learning-databases/autos/ first):

> library(cluster)
> library(apcluster)
> automobiles <- read.table("imports-85.data", header=FALSE, sep=",", na.strings="?")
> sim <- -as.matrix(daisy(automobiles))
> apcluster(sim)

The most essential part here is to use daisy() from the package "cluster" for computing distances/similarities. Have a look to the help page of daisy() to get a better impression how it works and how to tailor the distance/similarity calculations to your needs.

I do not know whether this is a good data set for clustering. Affinity propagation produces quite a number of clusters. Maybe fiddling with the input preferences is necessary (see Section 4 of vignette of package "apcluster").

Best regards,
Ulrich

Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Ulrich Bodenhofer
In reply to this post by Joris FA Meys
Sorry, Joris, I overlooked that you already mentioned daisy() in your posting. I should have credited your recommendation in my previous message.

Cheers, Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

abanero
In reply to this post by Ulrich Bodenhofer

Ulrich wrote:
>Affinity propagation produces quite a number of clusters.


I tried with q=0 and produces 17 clusters. Anyway that's a good idea, thanks. I'm looking to test it with my dataset.

So I'll probably use daisy() to compute an appropriate dissimilarity then apcluster() or another method to determine clusters.

What do you suggest in order to assign a new observation to a determined cluster?

 It seems that RandomForest doesn't work with both numerical and categorical predictors (thanks to Joris).

Christian wrote:
>and the implement
>nearest neighbours classification myself if I needed it.
>It should be pretty straightforward to implement.

Do you intend modify the code of the knn1() function by yourself?


thanks to everyone!
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Christian Hennig

> Christian wrote:
>> and the implement
>> nearest neighbours classification myself if I needed it.
>> It should be pretty straightforward to implement.
>
> Do you intend modify the code of the knn1() function by yourself?

No; if you understand what the nearest neighbours method does, it's not
very complicated to implement it from scratch (assuming that your dataset
is small enough that you don't have to worry too much about optimising
computing times). A bit of programming experience is required, though.
(It's not that I intend to do it right now, I suggest that you do it if
you can...)

Christian

>
>
> thanks to everyone!
>
> --
> View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[hidden email], www.homepages.ucl.ac.uk/~ucakche

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

Ulrich Bodenhofer
In reply to this post by abanero
>
> What do you suggest in order to assign a new observation to a determined cluster?
>
As I mentioned already, I would simply assign the new observation to the cluster to whose exemplar the new observation is most similar to (in a knn1-like fashion). To compute these similarities, you can use the daisy() function. However, you have to do some tricks, since daisy() is designed for computing square matrices of all mutual distances for a given data set. I did not find another function that is better suitable (e.g. a function that allows to compute simply the distance of two distinct samples). Maybe others have an idea. In any case, you have to make sure that data either remain unscaled or that you take care yourself that your new observation is scaled exactly with the same parameters that were used for clustering before.

Cheers, Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis and supervised classification: an alternative to knn1?

abanero
Hi Ulrich,
 I'm studying the principles of Affinity Propagation and I'm really glad to use your package (apcluster) in order to cluster my data.  I have just an issue to solve..

If I apply the funcion: apcluster(sim)

where sim is the matrix of dissimilarities, sometimes I encounter the warning message:

"Algorithm did not converge. Turn on details
and call plot() to monitor net similarity. Consider
increasing maxits and convits, and, if oscillations occur
also increasing damping factor lam."
 
with  too high number of clusters.
 
I thought to solve the problem setting the argument "p" of the function apcluster() to mean(PreferenceRange(sim)):


apcluster(sim, p=mean(preferenceRange(sim)))

and actually it seems to be a good solution because I don't receive any warning message and the number of cluster is slower.

Do you think it's a good solution? I submitt that I have to use apcluster() in an automatic procedure so I can't manipulate directly the arguments of the funcion.

Thanks in advance.
Giuseppe