Quantcast

PAM: how to get the best number of clusters

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

PAM: how to get the best number of clusters

Maura E Monville
I have a pretty big similarity matrix (2870x2870). I will produce even
bigger ones soon.
I am using PAM to generate clusters.
The desired number of output clusters is a PAM input parameter.
I do not know  a-priopri what is the best clusters layout .
I resorted to the silhouette test. It takes forever as I have to run PAM
with all possible
numbers of clusters.
I wonder whether there is some faster method, either a s/w code or some
theoretical guidelines,
to get the optimum clusters number.

Thank you very much,
--
Maura E.M

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PAM: how to get the best number of clusters

Dylan Beaudette-2
On Thursday 30 October 2008, Maura E Monville wrote:

> I have a pretty big similarity matrix (2870x2870). I will produce even
> bigger ones soon.
> I am using PAM to generate clusters.
> The desired number of output clusters is a PAM input parameter.
> I do not know  a-priopri what is the best clusters layout .
> I resorted to the silhouette test. It takes forever as I have to run PAM
> with all possible
> numbers of clusters.
> I wonder whether there is some faster method, either a s/w code or some
> theoretical guidelines,
> to get the optimum clusters number.
>
> Thank you very much,

This is a very general topic in the field of multivariate analysis. There
really isn't any way to know the 'correct' number of clusters, however there
are several metrics that can give you an indication of how messy your data
are.

For information on the methods in the cluster package, see this book:

Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction to
Cluster Analysis Wiley-Interscience, 2005

Otherwise, consider a book on multivariate analysis. Alternatively, try a
hierarchical clustering approach, and look for meaningful groupings. Some
thing like this:

d <- diana(daisy(your_data_matrix))
d.hc <- as.hclust(d)

d.hc$labels <- your_data_matrix$id

plot(d.hc)

Cheers,

Dylan


--
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PAM: how to get the best number of clusters

Maura E Monville
I have the book you mentioned. It basically describes the silhouette method.
I do not have it handy as I moved so it is still in some box. However I
cannot remember that book providing any other criterion to find the best
clusters number.
On the other hand I have the same problem with hierarchical clustering
techniques.
I use clusters as exploratory analysis because I do not have any a-priori
knowledge that helps me make a choice.
How can multivariate analysis help?
I launched a loop where the silhouette test follows PAM which is passed a
clusters number increased by 1 at each iteration.
Since I am observing that the silhouette value is now oscillating among
negative numbers, I wonder whether I can assume that it can only grow worse
once it has turned negative the first time so leave the loop after the first
negative number and choose the clusters number associated with the biggest
positive silhouette value.
This procedure would spare a lot of CPU time.

Thank you very much,
Maura

On Thu, Oct 30, 2008 at 7:25 PM, Dylan Beaudette
<[hidden email]>wrote:

> On Thursday 30 October 2008, Maura E Monville wrote:
> > I have a pretty big similarity matrix (2870x2870). I will produce even
> > bigger ones soon.
> > I am using PAM to generate clusters.
> > The desired number of output clusters is a PAM input parameter.
> > I do not know  a-priopri what is the best clusters layout .
> > I resorted to the silhouette test. It takes forever as I have to run PAM
> > with all possible
> > numbers of clusters.
> > I wonder whether there is some faster method, either a s/w code or some
> > theoretical guidelines,
> > to get the optimum clusters number.
> >
> > Thank you very much,
>
> This is a very general topic in the field of multivariate analysis. There
> really isn't any way to know the 'correct' number of clusters, however
> there
> are several metrics that can give you an indication of how messy your data
> are.
>
> For information on the methods in the cluster package, see this book:
>
> Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction to
> Cluster Analysis Wiley-Interscience, 2005
>
> Otherwise, consider a book on multivariate analysis. Alternatively, try a
> hierarchical clustering approach, and look for meaningful groupings. Some
> thing like this:
>
> d <- diana(daisy(your_data_matrix))
> d.hc <- as.hclust(d)
>
> d.hc$labels <- your_data_matrix$id
>
> plot(d.hc)
>
> Cheers,
>
> Dylan
>
>
> --
> Dylan Beaudette
> Soil Resource Laboratory
> http://casoilresource.lawr.ucdavis.edu/
> University of California at Davis
> 530.754.7341
>



--
Maura E.M

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PAM: how to get the best number of clusters

Dylan Beaudette-2
On Thursday 30 October 2008, Maura E Monville wrote:

> I have the book you mentioned. It basically describes the silhouette
> method. I do not have it handy as I moved so it is still in some box.
> However I cannot remember that book providing any other criterion to find
> the best clusters number.
> On the other hand I have the same problem with hierarchical clustering
> techniques.
> I use clusters as exploratory analysis because I do not have any a-priori
> knowledge that helps me make a choice.
> How can multivariate analysis help?
> I launched a loop where the silhouette test follows PAM which is passed a
> clusters number increased by 1 at each iteration.
> Since I am observing that the silhouette value is now oscillating among
> negative numbers, I wonder whether I can assume that it can only grow worse
> once it has turned negative the first time so leave the loop after the
> first negative number and choose the clusters number associated with the
> biggest positive silhouette value.
> This procedure would spare a lot of CPU time.

Another approach might involve the stepFlexclust() from the flexclust package.
See the manual page for this function for examples.

Dylan


> Thank you very much,
> Maura
>
> On Thu, Oct 30, 2008 at 7:25 PM, Dylan Beaudette
>
> <[hidden email]>wrote:
> > On Thursday 30 October 2008, Maura E Monville wrote:
> > > I have a pretty big similarity matrix (2870x2870). I will produce even
> > > bigger ones soon.
> > > I am using PAM to generate clusters.
> > > The desired number of output clusters is a PAM input parameter.
> > > I do not know  a-priopri what is the best clusters layout .
> > > I resorted to the silhouette test. It takes forever as I have to run
> > > PAM with all possible
> > > numbers of clusters.
> > > I wonder whether there is some faster method, either a s/w code or some
> > > theoretical guidelines,
> > > to get the optimum clusters number.
> > >
> > > Thank you very much,
> >
> > This is a very general topic in the field of multivariate analysis. There
> > really isn't any way to know the 'correct' number of clusters, however
> > there
> > are several metrics that can give you an indication of how messy your
> > data are.
> >
> > For information on the methods in the cluster package, see this book:
> >
> > Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction to
> > Cluster Analysis Wiley-Interscience, 2005
> >
> > Otherwise, consider a book on multivariate analysis. Alternatively, try a
> > hierarchical clustering approach, and look for meaningful groupings. Some
> > thing like this:
> >
> > d <- diana(daisy(your_data_matrix))
> > d.hc <- as.hclust(d)
> >
> > d.hc$labels <- your_data_matrix$id
> >
> > plot(d.hc)
> >
> > Cheers,
> >
> > Dylan
> >
> >
> > --
> > Dylan Beaudette
> > Soil Resource Laboratory
> > http://casoilresource.lawr.ucdavis.edu/
> > University of California at Davis
> > 530.754.7341



--
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PAM: how to get the best number of clusters

Maura E Monville
My problem is that I have already a distance (similarity) matrix generated
outside R through a C++ code because the criteria to calculate the
"distance" between pairs of objects are none of the standard criteria
implemented in R.
If I got it right, but I might be mistaken, stepFlexclust() performs the
clusters layout optimization by calling either one of  kcca or cclust which
calculate their own similarity matrix.
I just need a function or method to optimize the clusters number no matter
how the similarity matrix has been generated and no matter which clustering
function I use (PAM).
Is this at all possible ?

Thank you very much,
Maura



On Fri, Oct 31, 2008 at 12:18 AM, Dylan Beaudette <[hidden email]
> wrote:

> On Thursday 30 October 2008, Maura E Monville wrote:
> > I have the book you mentioned. It basically describes the silhouette
> > method. I do not have it handy as I moved so it is still in some box.
> > However I cannot remember that book providing any other criterion to find
> > the best clusters number.
> > On the other hand I have the same problem with hierarchical clustering
> > techniques.
> > I use clusters as exploratory analysis because I do not have any a-priori
> > knowledge that helps me make a choice.
> > How can multivariate analysis help?
> > I launched a loop where the silhouette test follows PAM which is passed a
> > clusters number increased by 1 at each iteration.
> > Since I am observing that the silhouette value is now oscillating among
> > negative numbers, I wonder whether I can assume that it can only grow
> worse
> > once it has turned negative the first time so leave the loop after the
> > first negative number and choose the clusters number associated with the
> > biggest positive silhouette value.
> > This procedure would spare a lot of CPU time.
>
> Another approach might involve the stepFlexclust() from the flexclust
> package.
> See the manual page for this function for examples.
>
> Dylan
>
>
> > Thank you very much,
> > Maura
> >
> > On Thu, Oct 30, 2008 at 7:25 PM, Dylan Beaudette
> >
> > <[hidden email]>wrote:
> > > On Thursday 30 October 2008, Maura E Monville wrote:
> > > > I have a pretty big similarity matrix (2870x2870). I will produce
> even
> > > > bigger ones soon.
> > > > I am using PAM to generate clusters.
> > > > The desired number of output clusters is a PAM input parameter.
> > > > I do not know  a-priopri what is the best clusters layout .
> > > > I resorted to the silhouette test. It takes forever as I have to run
> > > > PAM with all possible
> > > > numbers of clusters.
> > > > I wonder whether there is some faster method, either a s/w code or
> some
> > > > theoretical guidelines,
> > > > to get the optimum clusters number.
> > > >
> > > > Thank you very much,
> > >
> > > This is a very general topic in the field of multivariate analysis.
> There
> > > really isn't any way to know the 'correct' number of clusters, however
> > > there
> > > are several metrics that can give you an indication of how messy your
> > > data are.
> > >
> > > For information on the methods in the cluster package, see this book:
> > >
> > > Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction
> to
> > > Cluster Analysis Wiley-Interscience, 2005
> > >
> > > Otherwise, consider a book on multivariate analysis. Alternatively, try
> a
> > > hierarchical clustering approach, and look for meaningful groupings.
> Some
> > > thing like this:
> > >
> > > d <- diana(daisy(your_data_matrix))
> > > d.hc <- as.hclust(d)
> > >
> > > d.hc$labels <- your_data_matrix$id
> > >
> > > plot(d.hc)
> > >
> > > Cheers,
> > >
> > > Dylan
> > >
> > >
> > > --
> > > Dylan Beaudette
> > > Soil Resource Laboratory
> > > http://casoilresource.lawr.ucdavis.edu/
> > > University of California at Davis
> > > 530.754.7341
>
>
>
> --
> Dylan Beaudette
> Soil Resource Laboratory
> http://casoilresource.lawr.ucdavis.edu/
> University of California at Davis
> 530.754.7341
>



--
Maura E.M

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PAM: how to get the best number of clusters

droberts
Maura,

    No, in general it is not possible.  Depending on your
goodness-of-clustering metric (and there are many besides silhouette
width), the results may demonstrate multiple peaks and be severely
non-monotonic.  Perhaps more problematic, silhouette width is especially
difficult.  It is local, as opposed to global, so that it only compares
an object to the cluster to which it is assigned and to the most similar
cluster.  Each object in a cluster may be most similar to a different
cluster, so that re-assigning a single plot from one cluster to another
requires recalculating the whole mess for all clusters.  In addition,
despite the utility of a mean silhouette width, it is insufficient as a
metric in that many different solutions might give the same mean
silhouette width and yet be very different in numbers of reversals
(negative silhouette widths) or variance in within-cluster silhouette
width.  The beauty of the silhouette width concept is the plot, and that
means you have to actually look at the plot of the solution, and not
simply accept a mean silhouette width.

    I think you need to consider realistically how many clusters you
want, and limit yor consideration to solutions of approximately that number.

Good luck, Dave R.
- -
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts                                     office 406-994-4548
Professor and Head                                      FAX 406-994-3190
Department of Ecology                         email [hidden email]
Montana State University
Bozeman, MT 59717-3460

Maura E Monville wrote:

> My problem is that I have already a distance (similarity) matrix generated
> outside R through a C++ code because the criteria to calculate the
> "distance" between pairs of objects are none of the standard criteria
> implemented in R.
> If I got it right, but I might be mistaken, stepFlexclust() performs the
> clusters layout optimization by calling either one of  kcca or cclust which
> calculate their own similarity matrix.
> I just need a function or method to optimize the clusters number no matter
> how the similarity matrix has been generated and no matter which clustering
> function I use (PAM).
> Is this at all possible ?
>
> Thank you very much,
> Maura
>
>
>
> On Fri, Oct 31, 2008 at 12:18 AM, Dylan Beaudette <[hidden email]
>> wrote:
>
>> On Thursday 30 October 2008, Maura E Monville wrote:
>>> I have the book you mentioned. It basically describes the silhouette
>>> method. I do not have it handy as I moved so it is still in some box.
>>> However I cannot remember that book providing any other criterion to find
>>> the best clusters number.
>>> On the other hand I have the same problem with hierarchical clustering
>>> techniques.
>>> I use clusters as exploratory analysis because I do not have any a-priori
>>> knowledge that helps me make a choice.
>>> How can multivariate analysis help?
>>> I launched a loop where the silhouette test follows PAM which is passed a
>>> clusters number increased by 1 at each iteration.
>>> Since I am observing that the silhouette value is now oscillating among
>>> negative numbers, I wonder whether I can assume that it can only grow
>> worse
>>> once it has turned negative the first time so leave the loop after the
>>> first negative number and choose the clusters number associated with the
>>> biggest positive silhouette value.
>>> This procedure would spare a lot of CPU time.
>> Another approach might involve the stepFlexclust() from the flexclust
>> package.
>> See the manual page for this function for examples.
>>
>> Dylan
>>
>>
>>> Thank you very much,
>>> Maura
>>>
>>> On Thu, Oct 30, 2008 at 7:25 PM, Dylan Beaudette
>>>
>>> <[hidden email]>wrote:
>>>> On Thursday 30 October 2008, Maura E Monville wrote:
>>>>> I have a pretty big similarity matrix (2870x2870). I will produce
>> even
>>>>> bigger ones soon.
>>>>> I am using PAM to generate clusters.
>>>>> The desired number of output clusters is a PAM input parameter.
>>>>> I do not know  a-priopri what is the best clusters layout .
>>>>> I resorted to the silhouette test. It takes forever as I have to run
>>>>> PAM with all possible
>>>>> numbers of clusters.
>>>>> I wonder whether there is some faster method, either a s/w code or
>> some
>>>>> theoretical guidelines,
>>>>> to get the optimum clusters number.
>>>>>
>>>>> Thank you very much,
>>>> This is a very general topic in the field of multivariate analysis.
>> There
>>>> really isn't any way to know the 'correct' number of clusters, however
>>>> there
>>>> are several metrics that can give you an indication of how messy your
>>>> data are.
>>>>
>>>> For information on the methods in the cluster package, see this book:
>>>>
>>>> Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data An Introduction
>> to
>>>> Cluster Analysis Wiley-Interscience, 2005
>>>>
>>>> Otherwise, consider a book on multivariate analysis. Alternatively, try
>> a
>>>> hierarchical clustering approach, and look for meaningful groupings.
>> Some
>>>> thing like this:
>>>>
>>>> d <- diana(daisy(your_data_matrix))
>>>> d.hc <- as.hclust(d)
>>>>
>>>> d.hc$labels <- your_data_matrix$id
>>>>
>>>> plot(d.hc)
>>>>
>>>> Cheers,
>>>>
>>>> Dylan
>>>>
>>>>
>>>> --
>>>> Dylan Beaudette
>>>> Soil Resource Laboratory
>>>> http://casoilresource.lawr.ucdavis.edu/
>>>> University of California at Davis
>>>> 530.754.7341
>>
>>
>> --
>> Dylan Beaudette
>> Soil Resource Laboratory
>> http://casoilresource.lawr.ucdavis.edu/
>> University of California at Davis
>> 530.754.7341
>>
>
>
>


-

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...