cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

Markus Preisetanz
Dear R Specialists,

 

when trying to cluster a data.frame with about 80.000 rows and 25 columns I get the above error message. I tried hclust (using dist), agnes (entering the data.frame directly) and pam (entering the data.frame directly). What I actually do not want to do is generate a random sample from the data.

 

The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.

 

Does anybody know what to do?

 

Sincerely

___________________

Markus Preisetanz

Consultant

 

Client Vela GmbH

Albert-Roßhaupter-Str. 32

81369 München

fon:          +49 (0) 89 742 17-113

fax:          +49 (0) 89 742 17-150

mailto:[hidden email] <mailto:[hidden email]>



Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

This e-mail may contain confidential and/or privileged infor...{{dropped}}


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

RE: cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

Liaw, Andy
Let's do some simple calculation:  The dist object from a data set with
80000 cases would have

  80000 * (80000 - 1) / 2

elements, each takes 8 bytes to be stored in double precision.  That's over
24GB if my arithmetic isn't too flaky.  You'd have a devil of a time trying
to do this on a 64-bit machine with 32GB RAM, let alone what you are using.
You'd have much better chance sticking with algorithms that do not require
storage of the (dis)similarity matrix.

Andy

From: Markus Preisetanz

>
> Dear R Specialists,
>
>  
>
> when trying to cluster a data.frame with about 80.000 rows
> and 25 columns I get the above error message. I tried hclust
> (using dist), agnes (entering the data.frame directly) and
> pam (entering the data.frame directly). What I actually do
> not want to do is generate a random sample from the data.
>
>  
>
> The machine I run R on is a Windows 2000 Server (Pentium 4)
> with 2 GB of RAM.
>
>  
>
> Does anybody know what to do?
>
>  
>
> Sincerely
>
> ___________________
>
> Markus Preisetanz
>
> Consultant
>
>  
>
> Client Vela GmbH
>
> Albert-Roßhaupter-Str. 32
>
> 81369 München
>
> fon:          +49 (0) 89 742 17-113
>
> fax:          +49 (0) 89 742 17-150
>
> mailto:[hidden email]
> <mailto:[hidden email]>
>
>
>
> Diese E-Mail enthält vertrauliche und/oder rechtlich
> geschützte Informationen. Wenn Sie nicht der richtige
> Adressat sind oder diese E-Mail irrtümlich erhalten haben,
> informieren Sie bitte sofort den Absender und vernichten Sie
> diese Mail. Das unerlaubte Kopieren sowie die unbefugte
> Weitergabe dieser E-Mail ist nicht gestattet.
>
> This e-mail may contain confidential and/or privileged
> infor...{{dropped}}
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: cluster analysis for 80000 observations

Martin Maechler
In reply to this post by Markus Preisetanz
>>>>> "Markus" == Markus Preisetanz <[hidden email]>
>>>>>     on Thu, 26 Jan 2006 20:48:29 +0100 writes:

    Markus> Dear R Specialists,
    Markus> when trying to cluster a data.frame with about 80.000 rows and 25 columns I get the above error message. I tried hclust (using dist), agnes (entering the data.frame directly) and pam (entering the data.frame directly). What I actually do not want to do is generate a random sample from the data.

Currently all the above mentioned cluster methods work with
full distance / dissimilarity objects, even if only internally,
i.e. they store all d_{i,j} for  1 <= i < j <= n, i.e.  n(n-1)/2 values,
also each of them in double precision, i.e. 8 bytes.

So: no chance with the above functions and n=80'000

 Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.

If you would run an machine with a 64-bit version of OS and R
{typical case today: Linux on AMD Opteron}, you could go up
quite a bit higher than on your Windoze box,
{I vaguely remember I could do  'n = a few thousand' on our
 dual opteron with 16 GBytes}, but 80'000 is definitely too
large.

OTOH, there is clara() in the cluster package, which has been
designed for such situations,
         CLARA:= [C]lustering [LAR]ge [A]pplications.
It is similar in spirit to pam(),
*does* cluster all 80'000 observations but does so by taking
sub samples to construct the medoids.
(and you can ask it to take many medium size subsamples, instead
 of just 5 small sized ones as it does by default).

Martin Maechler, ETH Zurich
maintainer of "cluster" package.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html