large data set, error: cannot allocate vector

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

large data set, error: cannot allocate vector

Robert Citek

Why am I getting the error "Error: cannot allocate vector of size  
512000 Kb" on a machine with 6 GB of RAM?

I'm playing with some large data sets within R and doing some simple  
statistics.  The data sets have 10^6 and 10^7 rows of numbers.  R  
reads in and performs summary() on the 10^6 set just fine.  However,  
on the 10^7 set, R halts with the error.  My hunch is that somewhere  
there's an setting to limit some memory size to 500 MB.  What setting  
is that, can it be increased, and if so how?  Googling for the error  
has produced lots of hits but none with answers, yet.  Still browsing.

Below is a transcript of the session.

Thanks in advance for any pointers in the right direction.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

-------

$ uname -sorv ; rpm -q R ; R --version
Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/Linux
R-2.3.0-2.fc4
R version 2.3.0 (2006-04-24)
Copyright (C) 2006 R Development Core Team

$ wc -l dataset.010MM.txt
10000000 dataset.010MM.txt

$ head -3 dataset.010MM.txt
15623
3845
22309

$ wc -l dataset.100MM.txt
100000000 dataset.100MM.txt

$ head -3 dataset.100MM.txt
15623
3845
22309

$ cat ex3.r
options(width=1000)
foo <- read.delim("dataset.010MM.txt")
summary(foo)
foo <- read.delim("dataset.100MM.txt")
summary(foo)

$ R < ex3.r

R > foo <- read.delim("dataset.010MM.txt")

R > summary(foo)
      X15623
Min.   :    1
1st Qu.: 8152
Median :16459
Mean   :16408
3rd Qu.:24618
Max.   :32766

R > foo <- read.delim("dataset.100MM.txt")
Error: cannot allocate vector of size 512000 Kb
Execution halted

$ free -m
              total       used       free     shared    buffers      
cached
Mem:          6084       3233       2850          0          
20         20
-/+ buffers/cache:       3193       2891
Swap:         2000       2000          0

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Uwe Ligges
Robert Citek wrote:

> Why am I getting the error "Error: cannot allocate vector of size  
> 512000 Kb" on a machine with 6 GB of RAM?

1. The message means that you cannot allocate *further* 512Mb of RAM
right now for the next step, but not what is required nor what R is
currently consuming.

2. This seems to be a 32-bit OS. It limits the maximal allocation for
the *single* R process to < 4Gb (if all goes very well).


> I'm playing with some large data sets within R and doing some simple  
> statistics.  The data sets have 10^6 and 10^7 rows of numbers.  R  

3. 10^7 rows is not large, if you have one column...

4. 10^7 needs 10 times what is needed for 10^6. Hence comparing 10^6 and
10^7 is quite a difference.

Uwe Ligges

> reads in and performs summary() on the 10^6 set just fine.  However,  
> on the 10^7 set, R halts with the error.  My hunch is that somewhere  
> there's an setting to limit some memory size to 500 MB.  What setting  
> is that, can it be increased, and if so how?  Googling for the error  
> has produced lots of hits but none with answers, yet.  Still browsing.
>
> Below is a transcript of the session.
>
> Thanks in advance for any pointers in the right direction.
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> -------
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ wc -l dataset.010MM.txt
> 10000000 dataset.010MM.txt
>
> $ head -3 dataset.010MM.txt
> 15623
> 3845
> 22309
>
> $ wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
>
> $ head -3 dataset.100MM.txt
> 15623
> 3845
> 22309
>
> $ cat ex3.r
> options(width=1000)
> foo <- read.delim("dataset.010MM.txt")
> summary(foo)
> foo <- read.delim("dataset.100MM.txt")
> summary(foo)
>
> $ R < ex3.r
>
> R > foo <- read.delim("dataset.010MM.txt")
>
> R > summary(foo)
>       X15623
> Min.   :    1
> 1st Qu.: 8152
> Median :16459
> Mean   :16408
> 3rd Qu.:24618
> Max.   :32766
>
> R > foo <- read.delim("dataset.100MM.txt")
> Error: cannot allocate vector of size 512000 Kb
> Execution halted
>
> $ free -m
>               total       used       free     shared    buffers      
> cached
> Mem:          6084       3233       2850          0          
> 20         20
> -/+ buffers/cache:       3193       2891
> Swap:         2000       2000          0
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
In reply to this post by Robert Citek
Hello Robert,

?Memory and ?memory.size will be very useful to you in resolving this.

Please also note that the R/Windows faq addresses these issues for a Windows
installation.  http://www.stats.ox.ac.uk/pub/R/rw-FAQ.html

Due to this list and the above link, I've found success using --max-mem-size
when invoking R. I'd start w/ --max-mem-size.

Not sure what OS you are using, but Windows will be more restrictive on
memory (depending on whether you're using a Server edition, etc.

HTH,
-jason

----- Original Message -----
From: "Robert Citek" <[hidden email]>
To: <[hidden email]>
Sent: Friday, May 05, 2006 8:24 AM
Subject: [R] large data set, error: cannot allocate vector


>
> Why am I getting the error "Error: cannot allocate vector of size
> 512000 Kb" on a machine with 6 GB of RAM?
>
> I'm playing with some large data sets within R and doing some simple
> statistics.  The data sets have 10^6 and 10^7 rows of numbers.  R
> reads in and performs summary() on the 10^6 set just fine.  However,
> on the 10^7 set, R halts with the error.  My hunch is that somewhere
> there's an setting to limit some memory size to 500 MB.  What setting
> is that, can it be increased, and if so how?  Googling for the error
> has produced lots of hits but none with answers, yet.  Still browsing.
>
> Below is a transcript of the session.
>
> Thanks in advance for any pointers in the right direction.
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> -------
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ wc -l dataset.010MM.txt
> 10000000 dataset.010MM.txt
>
> $ head -3 dataset.010MM.txt
> 15623
> 3845
> 22309
>
> $ wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
>
> $ head -3 dataset.100MM.txt
> 15623
> 3845
> 22309
>
> $ cat ex3.r
> options(width=1000)
> foo <- read.delim("dataset.010MM.txt")
> summary(foo)
> foo <- read.delim("dataset.100MM.txt")
> summary(foo)
>
> $ R < ex3.r
>
> R > foo <- read.delim("dataset.010MM.txt")
>
> R > summary(foo)
>      X15623
> Min.   :    1
> 1st Qu.: 8152
> Median :16459
> Mean   :16408
> 3rd Qu.:24618
> Max.   :32766
>
> R > foo <- read.delim("dataset.100MM.txt")
> Error: cannot allocate vector of size 512000 Kb
> Execution halted
>
> $ free -m
>              total       used       free     shared    buffers
> cached
> Mem:          6084       3233       2850          0
> 20         20
> -/+ buffers/cache:       3193       2891
> Swap:         2000       2000          0
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Robert Citek

Oops.  I was off by an order of magnitude.  I meant 10^7 and 10^8  
rows of data for the first and second data sets, respectively.

On May 5, 2006, at 10:24 AM, Robert Citek wrote:

> R > foo <- read.delim("dataset.010MM.txt")
>
> R > summary(foo)
>       X15623
> Min.   :    1
> 1st Qu.: 8152
> Median :16459
> Mean   :16408
> 3rd Qu.:24618
> Max.   :32766

Reloaded the 10MM set and ran an object.size:

R > object.size(foo)
[1] 440000376

So, 10 MM numbers in about 440 MB. (Are my units correct?)  That  
would explain why 10 MM numbers does work while 100 MM numbers won't  
work (4 GB limit on 32-bit machine).  If my units are correct, then  
each value would be taking up 4-bytes, which sounds right for a 4-
byte word (8 bits/byte * 4-bytes = 32-bits.)

 From Googling the archives, the solution that I've seen for working  
with large data sets seems to be moving to a 64-bit architecture.  
Short of that, are there any other generic workarounds, perhaps using  
a RDBMS or a CRAN package that enables working with arbitrarily large  
data sets?

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Thomas Lumley
In reply to this post by Robert Citek
On Fri, 5 May 2006, Robert Citek wrote:

>
> Why am I getting the error "Error: cannot allocate vector of size
> 512000 Kb" on a machine with 6 GB of RAM?
>

In addition to Uwe's message it is worth pointing out that gc() reports
the maximum memory that your program has used (the rightmost two columns).
You will probably see that this is large.

  -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Marc Schwartz (via MN)
In reply to this post by Uwe Ligges
On Fri, 2006-05-05 at 17:56 +0200, Uwe Ligges wrote:

> Robert Citek wrote:
>
> > Why am I getting the error "Error: cannot allocate vector of size  
> > 512000 Kb" on a machine with 6 GB of RAM?
>
> 1. The message means that you cannot allocate *further* 512Mb of RAM
> right now for the next step, but not what is required nor what R is
> currently consuming.
>
> 2. This seems to be a 32-bit OS. It limits the maximal allocation for
> the *single* R process to < 4Gb (if all goes very well).
>
>
> > I'm playing with some large data sets within R and doing some simple  
> > statistics.  The data sets have 10^6 and 10^7 rows of numbers.  R  
>
> 3. 10^7 rows is not large, if you have one column...
>
> 4. 10^7 needs 10 times what is needed for 10^6. Hence comparing 10^6 and
> 10^7 is quite a difference.
>
> Uwe Ligges
>
> > reads in and performs summary() on the 10^6 set just fine.  However,  
> > on the 10^7 set, R halts with the error.  My hunch is that somewhere  
> > there's an setting to limit some memory size to 500 MB.  What setting  
> > is that, can it be increased, and if so how?  Googling for the error  
> > has produced lots of hits but none with answers, yet.  Still browsing.
> >
> > Below is a transcript of the session.
> >
> > Thanks in advance for any pointers in the right direction.
> >
> > Regards,
> > - Robert
> >
> > $ uname -sorv ; rpm -q R ; R --version
> > Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/Linux

          ^^^^^^^^^^^^^^^^^^^^
<snip>

I might throw out one more pointer in addition to Uwe's comments above,
which _should_ not affect this issue, but as an FYI. Note that I said
"should not" versus "will not".

You are about 17 kernel versions behind. 2.6.11-1.1369_FC4smp was the
original FC4 SMP kernel.

The current FC4 kernel version is 2.6.16-1.2107_FC4smp.

This might suggest that your system in general may require some
substantial updating, which may more generally affect system behavior.

FC4 was rather unstable when first released and has improved notably
since then. It is one of the reasons that some folks are still running
FC3, even though it is EOL.

The current FC4 kernel release (noted above) has some issues with it at
present. A new kernel release version by Dave Jones, 2111, should be out
"any time now", but in the mean time, I would suggest updating your
kernel to version 2.6.16-1.2096smp and doing a full system update
generally.

You may very well find that some behaviors (related and/or unrelated to
this issue) do change for the better.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
In reply to this post by Robert Citek
I can store a 100,000,000 records in about the same space on WinXP,
with --max-mem-size set to 1700M. I have also successfully stored larger
objects.

Like you there's not enough space to process a summary, but I've only got
2GB of RAM.  I've successfully allocated more RAM to R on my Linux box (it
has 4GB RAM) and processed larger objects.

Have you tried playing w/ the memory settings?

My results are below.
-jason


> tmp<-100000000:200000000
> length(tmp)/1000000
[1] 100
> gc()
           used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells   172832   4.7     350000   9.4    350000    9.4
Vcells 50063180 382.0  120448825 919.0 150074853 1145.0
> object.size(tmp)/length(tmp)
[1] 4
> object.size(tmp)
[1] 4e+08
> print(object.size(tmp)/1024^2,digits=15)
[1] 381.469760894775
> summary(tmp)
Error: cannot allocate vector of size 390625 Kb





----- Original Message -----
From: "Robert Citek" <[hidden email]>
To: <[hidden email]>
Sent: Friday, May 05, 2006 9:30 AM
Subject: Re: [R] large data set, error: cannot allocate vector


>
> Oops.  I was off by an order of magnitude.  I meant 10^7 and 10^8
> rows of data for the first and second data sets, respectively.
>
> On May 5, 2006, at 10:24 AM, Robert Citek wrote:
>> R > foo <- read.delim("dataset.010MM.txt")
>>
>> R > summary(foo)
>>       X15623
>> Min.   :    1
>> 1st Qu.: 8152
>> Median :16459
>> Mean   :16408
>> 3rd Qu.:24618
>> Max.   :32766
>
> Reloaded the 10MM set and ran an object.size:
>
> R > object.size(foo)
> [1] 440000376
>
> So, 10 MM numbers in about 440 MB. (Are my units correct?)  That
> would explain why 10 MM numbers does work while 100 MM numbers won't
> work (4 GB limit on 32-bit machine).  If my units are correct, then
> each value would be taking up 4-bytes, which sounds right for a 4-
> byte word (8 bits/byte * 4-bytes = 32-bits.)
>
> From Googling the archives, the solution that I've seen for working
> with large data sets seems to be moving to a 64-bit architecture.
> Short of that, are there any other generic workarounds, perhaps using
> a RDBMS or a CRAN package that enables working with arbitrarily large
> data sets?
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Jason Barnhart

On May 5, 2006, at 11:28 AM, Jason Barnhart wrote:
> ?Memory

Thanks.  That pointed me to mem.limits(), gc(), and object.size().

> ?memory.size will be very useful to you in resolving this.

That gave me an error:

R > ?memory.size
No documentation for 'memory.size' in specified packages and libraries:
you could try 'help.search("memory.size")'

> Not sure what OS you are using, but Windows will be more  
> restrictive on
> memory (depending on whether you're using a Server edition, etc.

I'm using R-2.3.0-2 under Fedora Core 4/Linux, which apparently  
doesn't seem have any set limits:

R > mem.limits()
nsize vsize
    NA    NA

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Thomas Lumley

On May 5, 2006, at 11:30 AM, Thomas Lumley wrote:
> In addition to Uwe's message it is worth pointing out that gc()  
> reports
> the maximum memory that your program has used (the rightmost two  
> columns).
> You will probably see that this is large.

Reloading the 10 MM dataset:

R > foo <- read.delim("dataset.010MM.txt")

R > object.size(foo)
[1] 440000376

R > gc()
            used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 10183941 272.0   15023450 401.2 10194267 272.3
Vcells 20073146 153.2   53554505 408.6 50086180 382.2

Combined, Ncells or Vcells appear to take up about 700 MB of RAM,  
which is about 25% of the 3 GB available under Linux on 32-bit  
architecture.  Also, removing foo seemed to free up "used" memory,  
but didn't change the "max used":

R > rm(foo)

R > gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 186694  5.0   12018759 321.0 10194457 272.3
Vcells  74095  0.6   44173915 337.1 50085563 382.2

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
Please try memory.limit() to confirm how much system memory is available to
R.

Additionally, read.delim returns a data.frame.  You could use the colClasses
argument to change variable types (see example below) or use scan() which
returns a vector.  This would store the data more compactly.  The vector
object is significantly smaller than the data.frame.

It appears from your example session that you are examining a single
variable.  If so, a vector would suffice.

Note in the example below, processing large numbers in the integer type
creates an under/over flow error.

====================Begin Session====================================
> #create vector
> foovector<-scan(file="temp.txt")
Read 2490368 items

>
> #create data.frame
> foo<-read.delim(file="temp.txt",row.names=NULL,header=FALSE,colClasses=as.vector(c("numeric")))
> attributes(foo)$names<-"myfoo"
>
> foo2<-read.delim(file="temp.txt",row.names=NULL,header=FALSE,colClasses=as.vector(c("integer")))
> attributes(foo2)$names<-"myfoo"
>
> #vector from data.frame
> tmpfoo<-foo$myfoo
>
> #check size
> object.size(foo)
[1] 119538076
> object.size(foo2)
[1] 109576604
> object.size(foovector)
[1] 19922972
> object.size(tmpfoo)
[1] 19922972
>
> #check sums
> sum(tmpfoo)
[1] 2.498528e+13
> sum(foo$myfoo)
[1] 2.498528e+13
> sum(foo2$myfoo)
[1] NA
Warning message:
Integer overflow in sum(.); use sum(as.numeric(.))
> sum(foovector)
[1] 2.498528e+13
>
> #show type
> class(foo2$myfoo)
[1] "integer"
> class(foo$myfoo)
[1] "numeric"
> class(tmpfoo)
[1] "numeric"
> class(foovector)
[1] "numeric"
====================End Session====================================












----- Original Message -----
From: "Robert Citek" <[hidden email]>
To: <[hidden email]>
Sent: Friday, May 05, 2006 3:15 PM
Subject: Re: [R] large data set, error: cannot allocate vector


>
> On May 5, 2006, at 11:30 AM, Thomas Lumley wrote:
>> In addition to Uwe's message it is worth pointing out that gc()
>> reports
>> the maximum memory that your program has used (the rightmost two
>> columns).
>> You will probably see that this is large.
>
> Reloading the 10 MM dataset:
>
> R > foo <- read.delim("dataset.010MM.txt")
>
> R > object.size(foo)
> [1] 440000376
>
> R > gc()
>            used  (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 10183941 272.0   15023450 401.2 10194267 272.3
> Vcells 20073146 153.2   53554505 408.6 50086180 382.2
>
> Combined, Ncells or Vcells appear to take up about 700 MB of RAM,
> which is about 25% of the 3 GB available under Linux on 32-bit
> architecture.  Also, removing foo seemed to free up "used" memory,
> but didn't change the "max used":
>
> R > rm(foo)
>
> R > gc()
>          used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 186694  5.0   12018759 321.0 10194457 272.3
> Vcells  74095  0.6   44173915 337.1 50085563 382.2
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Prof Brian Ripley
In reply to this post by Robert Citek
On Fri, 5 May 2006, Robert Citek wrote:

>
> On May 5, 2006, at 11:30 AM, Thomas Lumley wrote:
>> In addition to Uwe's message it is worth pointing out that gc()
>> reports
>> the maximum memory that your program has used (the rightmost two
>> columns).
>> You will probably see that this is large.
>
> Reloading the 10 MM dataset:

Ah, but the comment was about the 100 MM dataset, the one which gave you a
problem.

> R > foo <- read.delim("dataset.010MM.txt")
>
> R > object.size(foo)
> [1] 440000376
>
> R > gc()
>            used  (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 10183941 272.0   15023450 401.2 10194267 272.3
> Vcells 20073146 153.2   53554505 408.6 50086180 382.2
>
> Combined, Ncells or Vcells appear to take up about 700 MB of RAM,
> which is about 25% of the 3 GB available under Linux on 32-bit
> architecture.

Please re-read help("Memory-limits").  You have a 3Gb address space, and
are looking for at least one 500Mb chunk.  Fragmentation will come into
play here and it is quite likely that malloc will be unable to find a
500Mb chunk once you have allocated 1Gb.  In attempting to read the
100 MM dataset you probably did go over 1Gb.

> Also, removing foo seemed to free up "used" memory, but didn't change
> the "max used":

Well, it doesn't change history does it?  You were not expecting removing
objects to increase the memory used, I hope.  From ?gc:

      The final two columns show the maximum space used since the last
      call to 'gc(reset=TRUE)' (or since R started).

> R > rm(foo)
>
> R > gc()
>          used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 186694  5.0   12018759 321.0 10194457 272.3
> Vcells  74095  0.6   44173915 337.1 50085563 382.2

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Thomas Lumley
In reply to this post by Robert Citek
On Fri, 5 May 2006, Robert Citek wrote:

>
> On May 5, 2006, at 11:30 AM, Thomas Lumley wrote:
>> In addition to Uwe's message it is worth pointing out that gc()
>> reports
>> the maximum memory that your program has used (the rightmost two
>> columns).
>> You will probably see that this is large.
>
> Reloading the 10 MM dataset:
>
> R > foo <- read.delim("dataset.010MM.txt")
>
> R > object.size(foo)
> [1] 440000376
>
> R > gc()
>            used  (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 10183941 272.0   15023450 401.2 10194267 272.3
> Vcells 20073146 153.2   53554505 408.6 50086180 382.2
>
> Combined, Ncells or Vcells appear to take up about 700 MB of RAM,
> which is about 25% of the 3 GB available under Linux on 32-bit
> architecture.  Also, removing foo seemed to free up "used" memory,
> but didn't change the "max used":

No, that's what "max" means.  You need gc(reset=TRUE) to reset the max.

  -thomas


Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Jason Barnhart

On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
> Please try memory.limit() to confirm how much system memory is  
> available to R.

Unfortunately, memory.limit() is not available:

R > memory.limit()
Error: could not find function "memory.limit"

Did you mean mem.limits()?

R > mem.limits()
nsize vsize
    NA    NA

> Additionally, read.delim returns a data.frame.  You could use the  
> colClasses
> argument to change variable types (see example below) or use scan()  
> which
> returns a vector.  This would store the data more compactly.  The  
> vector
> object is significantly smaller than the data.frame.
>
> It appears from your example session that you are examining a single
> variable.  If so, a vector would suffice.

Yes, a vector worked very nicely (see below.)  In fact, using the  
vector method R was able to read in the 10 MM entry data set much  
faster than a data.frame.

The reason I have stayed with data.frames is because my "real" data  
is of a mixed type, much like a database table or spreadsheet.  
Unfortunately, my real data set takes too long to work with (~20 MM  
entries of mixed type which requires over 20 minutes just to load the  
data into R.)  In contrast, the toy data set is about the same number  
of entries, but only a single column, which captures some of the  
essence of my real data set but is a lot faster and easier to work with.

> Note in the example below, processing large numbers in the integer  
> type
> creates an under/over flow error.

Thanks for the examples.  They really help.

Here's a sample transcript from a bash shell under Linux comparing  
some timings using a vector within R:

$ uname -sorv ; rpm -q R ; R --version
Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
R-2.3.0-2.fc4
R version 2.3.0 (2006-04-24)
Copyright (C) 2006 R Development Core Team

$ time -p cat dataset.010MM.txt > /dev/null
real 0.04
user 0.00
sys 0.03

$ time -p cat dataset.100MM.txt > /dev/null
real 7.60
user 0.06
sys 0.67

$ time -p wc -l dataset.100MM.txt
100000000 dataset.100MM.txt
real 2.38
user 1.92
sys 0.44

$ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
   | time -p R -q --no-save

R > foov <- scan("dataset.010MM.txt") ; length(foov)
Read 10000000 items
[1] 10000000

real 9.93
user 9.41
sys 0.52

$ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
   | time -p R -q --no-save

R > foov <- scan("dataset.100MM.txt") ; length(foov)
Read 100000000 items
[1] 100000000

real 92.27
user 88.66
sys 3.58

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Thomas Lumley

On May 8, 2006, at 9:47 AM, Thomas Lumley wrote:

> On Fri, 5 May 2006, Robert Citek wrote:
>> Reloading the 10 MM dataset:
>>
>> R > foo <- read.delim("dataset.010MM.txt")
>>
>> R > object.size(foo)
>> [1] 440000376
>>
>> R > gc()
>>            used  (Mb) gc trigger  (Mb) max used  (Mb)
>> Ncells 10183941 272.0   15023450 401.2 10194267 272.3
>> Vcells 20073146 153.2   53554505 408.6 50086180 382.2
>>
>> Combined, Ncells or Vcells appear to take up about 700 MB of RAM,
>> which is about 25% of the 3 GB available under Linux on 32-bit
>> architecture.  Also, removing foo seemed to free up "used" memory,
>> but didn't change the "max used":
>
> No, that's what "max" means.  You need gc(reset=TRUE) to reset the  
> max.

Yup, that worked (see below).  The example from ?gc wasn't that clear  
to me.  Thanks for clarifying.  I also found it informative to  
compare loading data into a data.frame vs a vector.

$ cat <<eof | R -q --no-save
gc()
foo <- read.delim("dataset.010MM.txt")
gc()
rm(foo)
gc()
gc(reset=TRUE)
eof

R > gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells 177865  4.8     407500 10.9   350000  9.4
Vcells  72114  0.6     786432  6.0   333941  2.6

R > foo <- read.delim("dataset.010MM.txt")

R > gc()
            used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 10179849 271.9   15023450 401.2 10180159 271.9
Vcells 20072448 153.2   47764583 364.5 46849682 357.5

R > rm(foo)

R > gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 179910  4.9   12018759 321.0 10181187 271.9
Vcells  72458  0.6   38211666 291.6 46849682 357.5

R > gc(reset=TRUE)
          used (Mb) gc trigger  (Mb) max used (Mb)
Ncells 179920  4.9    9615007 256.8   179920  4.9
Vcells  72482  0.6   30569332 233.3    72482  0.6

$ cat <<eof | R -q --no-save
gc()
foo <- scan("dataset.010MM.txt")
gc()
rm(foo)
gc()
gc(reset=TRUE)
eof

R > gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells 177865  4.8     407500 10.9   350000  9.4
Vcells  72114  0.6     786432  6.0   333941  2.6

R > foo <- scan("dataset.010MM.txt")
Read 10000000 items

R > gc()
            used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   178230  4.8     407500  10.9   350000   9.4
Vcells 10072185 76.9   26713872 203.9 26456224 201.9

R > rm(foo)

R > gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 178286  4.8     407500  10.9   350000   9.4
Vcells  72190  0.6   21371097 163.1 26456224 201.9

R > gc(reset=TRUE)
          used (Mb) gc trigger  (Mb) max used (Mb)
Ncells 178296  4.8     407500  10.9   178296  4.8
Vcells  72214  0.6   17096877 130.5    72214  0.6

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
In reply to this post by Robert Citek
1) So the original problem remains unsolved?  You can load data but lack
memory to do more (or so it appears). It seems to me that your options are:
    a) ensure that the --max-mem-size option is allowing R to utilize all
available RAM
    b) sample if possible, i.e. are 20MM necessary
    c) load in matrices or vectors, then "process" or analyze
    d) load data in database that R connects to, use that engine for
processing
    e) drop unnecessary columns from data.frame
    f) analyze subsets of the data (variable-wise--review fewer vars at a
time)
    g) buy more RAM (32 vs 64 bit architecture should not be the issue,
since you use LINUX)
    h) ???

2) Not finding memory.limit() is very odd.  You should consider reviewing
the bug reporting process to determine if this should be reported.  Here's
an example of my output.
    > memory.limit()
    [1] 1782579200

3) This may not be the correct way to look at the timing differences you
experienced. However, it seems R is holding up well.

                    10MM  100MM  ratio-100MM/10MM
           cat      0.04   7.60  190.00
          scan      9.93  92.27    9.29
ratio scan/cat    248.25  12.14

Please let me know how you resolve.  I'm curious about your solution
HTH,
-jason


----- Original Message -----
From: "Robert Citek" <[hidden email]>
To: <[hidden email]>
Cc: "Jason Barnhart" <[hidden email]>
Sent: Tuesday, May 09, 2006 9:22 AM
Subject: Re: [R] large data set, error: cannot allocate vector


>
> On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
>> Please try memory.limit() to confirm how much system memory is  available
>> to R.
>
> Unfortunately, memory.limit() is not available:
>
> R > memory.limit()
> Error: could not find function "memory.limit"
>
> Did you mean mem.limits()?
>
> R > mem.limits()
> nsize vsize
>    NA    NA
>
>> Additionally, read.delim returns a data.frame.  You could use the
>> colClasses
>> argument to change variable types (see example below) or use scan()
>> which
>> returns a vector.  This would store the data more compactly.  The  vector
>> object is significantly smaller than the data.frame.
>>
>> It appears from your example session that you are examining a single
>> variable.  If so, a vector would suffice.
>
> Yes, a vector worked very nicely (see below.)  In fact, using the  vector
> method R was able to read in the 10 MM entry data set much  faster than a
> data.frame.
>
> The reason I have stayed with data.frames is because my "real" data  is of
> a mixed type, much like a database table or spreadsheet.   Unfortunately,
> my real data set takes too long to work with (~20 MM  entries of mixed
> type which requires over 20 minutes just to load the  data into R.)  In
> contrast, the toy data set is about the same number  of entries, but only
> a single column, which captures some of the  essence of my real data set
> but is a lot faster and easier to work with.
>
>> Note in the example below, processing large numbers in the integer  type
>> creates an under/over flow error.
>
> Thanks for the examples.  They really help.
>
> Here's a sample transcript from a bash shell under Linux comparing  some
> timings using a vector within R:
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ time -p cat dataset.010MM.txt > /dev/null
> real 0.04
> user 0.00
> sys 0.03
>
> $ time -p cat dataset.100MM.txt > /dev/null
> real 7.60
> user 0.06
> sys 0.67
>
> $ time -p wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
> real 2.38
> user 1.92
> sys 0.44
>
> $ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
>   | time -p R -q --no-save
>
> R > foov <- scan("dataset.010MM.txt") ; length(foov)
> Read 10000000 items
> [1] 10000000
>
> real 9.93
> user 9.41
> sys 0.52
>
> $ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
>   | time -p R -q --no-save
>
> R > foov <- scan("dataset.100MM.txt") ; length(foov)
> Read 100000000 items
> [1] 100000000
>
> real 92.27
> user 88.66
> sys 3.58
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek

On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:

> 1) So the original problem remains unsolved?

The question was answered but the problem remains unsolved.  The  
question was, why am I getting an error "cannot allocate vector" when  
reading in a 100 MM integer list.  The answer appears to be:

1) R loads the entire data set into RAM
2) on a 32-bit system R max'es out at 3 GB
3) loading 100 MM integer entries into a data.frame requires more  
than 3 GB of RAM (5-10 GB based on projections from 10 MM entries)

So, the new question is, how does one work around such limits?

> You can load data but lack memory to do more (or so it appears). It  
> seems to me that your options are:
>    a) ensure that the --max-mem-size option is allowing R to  
> utilize all available RAM

--max-mem-size doesn't exist in my version:

$ R --max-mem-size
WARNING: unknown option '--max-mem-size'

Do different versions of R on different OSes and different platforms  
have different options?

FWIW, here's the usage statement from ?mem.limits:

R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max-
ppsize=N

>    b) sample if possible, i.e. are 20MM necessary

Yes, or within a factor of 4 of that.

>    c) load in matrices or vectors, then "process" or analyze

Yes, I just need to learn more of the R language to do what I want.

>    d) load data in database that R connects to, use that engine for  
> processing

I have a gut feeling something like this is the way to go.

>    e) drop unnecessary columns from data.frame

Yes.  Currently, one of the fields is an identifier field which is a  
long text field (30+ chars).  That should probably be converted to an  
integer to conserve on both time and space.

>    f) analyze subsets of the data (variable-wise--review fewer vars  
> at a time)

Possibly.

>    g) buy more RAM (32 vs 64 bit architecture should not be the  
> issue, since you use LINUX)

32-bit seems to be the limit.  We've got 6 GB of RAM and 8 GB of  
swap.  Despite that R chokes well before those limits are reached.

>    h) ???

Yes, possibly some other solution we haven't considered.

> 2) Not finding memory.limit() is very odd.  You should consider  
> reviewing the bug reporting process to determine if this should be  
> reported.  Here's an example of my output.
>    > memory.limit()
>    [1] 1782579200

Do different versions of R on different OSes and different platforms  
have different functions?

> 3) This may not be the correct way to look at the timing  
> differences you experienced. However, it seems R is holding up well.
>
>                    10MM  100MM  ratio-100MM/10MM
>           cat      0.04   7.60  190.00
>          scan      9.93  92.27    9.29
> ratio scan/cat    248.25  12.14

I re-ran the timing test for the 100 MM file taking caching into  
account.  Linux with 6 GB has no problem caching the 100 MM file (600  
MB):

                     10MM    100MM  ratio-100MM/10MM
           cat       0.04     0.38    9.50
          scan       9.93    92.27    9.29
ratio scan/cat    248.25   242.82

> Please let me know how you resolve.  I'm curious about your solution
> HTH,

Indeed, very helpful.  I'm learning more about R every day.  Thanks  
for your feedback.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Robert Citek
In reply to this post by Marc Schwartz (via MN)

On May 5, 2006, at 11:54 AM, Marc Schwartz (via MN) wrote:

> On Fri, 2006-05-05 at 17:56 +0200, Uwe Ligges wrote:
>> Robert Citek wrote:
>>> $ uname -sorv ; rpm -q R ; R --version
>>> Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/
>>> Linux
>
>           ^^^^^^^^^^^^^^^^^^^^
> <snip>
>
> I might throw out one more pointer in addition to Uwe's comments  
> above,
> which _should_ not affect this issue, but as an FYI. Note that I said
> "should not" versus "will not".
>  ...
> The current FC4 kernel release (noted above) has some issues with  
> it at
> present. A new kernel release version by Dave Jones, 2111, should  
> be out
> "any time now", but in the mean time, I would suggest updating your
> kernel to version 2.6.16-1.2096smp and doing a full system update
> generally.
>
> You may very well find that some behaviors (related and/or  
> unrelated to
> this issue) do change for the better.

Thanks, Marc.  Upgraded:

Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux

Ran some simple tests and didn't notice any obvious improvement.  
Nothing obviously broke, either, so that's good.

Regards,
- Robert
http://www.cwelug.org/downloads
Help others get OpenSource software.  Distribute FLOSS
for Windows, Linux, *BSD, and MacOS X with BitTorrent

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Marc Schwartz (via MN)
On Tue, 2006-05-09 at 15:35 -0500, Robert Citek wrote:

> On May 5, 2006, at 11:54 AM, Marc Schwartz (via MN) wrote:
> >> Robert Citek wrote:
> >>> $ uname -sorv ; rpm -q R ; R --version
> >>> Linux 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 GNU/
> >>> Linux
> >
> >           ^^^^^^^^^^^^^^^^^^^^
> > <snip>
> >
> > I might throw out one more pointer in addition to Uwe's comments  
> > above,
> > which _should_ not affect this issue, but as an FYI. Note that I said
> > "should not" versus "will not".
> >  ...
> > The current FC4 kernel release (noted above) has some issues with  
> > it at
> > present. A new kernel release version by Dave Jones, 2111, should  
> > be out
> > "any time now", but in the mean time, I would suggest updating your
> > kernel to version 2.6.16-1.2096smp and doing a full system update
> > generally.
> >
> > You may very well find that some behaviors (related and/or  
> > unrelated to
> > this issue) do change for the better.
>
> Thanks, Marc.  Upgraded:
>
> Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
>
> Ran some simple tests and didn't notice any obvious improvement.  
> Nothing obviously broke, either, so that's good.

Robert,

Happy to help.

One quick note, which is that the 2111 version of the kernel was
released using that number for FC5, but as 2108 for FC4. This fixes bugs
that cropped up in the 2107 version, which I referenced. I did not
experience them, but many others did, which seem to be hardware specific
to an extent.

So you can go ahead and upgrade to 2108 at your leisure.

The reason for mentioning the upgrades (not just the kernel, but system
wide) is that you never know what other issues may arise from
changes/updates in compilers, libraries and applications as bugs are
fixed. In addition and certainly non-trivial, are many security patches
that have been implemented since FC4 was released in June 2005.

One further aside to the thread more generally, is that you might want
to look at the R-admin manual relative to some of the pros/cons of
updating to 64 bit hardware, since this was referenced in the posts
today relative to expanding your RAM limits. It is available online
here:

  http://cran.us.r-project.org/doc/manuals/R-admin.html

or as a PDF here:

  http://cran.us.r-project.org/doc/manuals/R-admin.pdf

The URLs above are using a U.S. mirror. Review section 8, "Choosing
between 32- and 64-bit builds".

There have also been many discussions on 32 vs. 64 bit by folks that you
will find in the r-help archives. Using:

  RSiteSearch("64 bit")

will get you almost 900 hits.

HTH,

Marc

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
In reply to this post by Robert Citek
Robert,

Thanks, I stand corrected on the RAM issue re: 32 vs. 64 bit builds.

As for the --max-memory-size option, I'll try to check my LINUX version at
home tonight.

-jason

----- Original Message -----
From: "Robert Citek" <[hidden email]>
To: <[hidden email]>
Cc: "Jason Barnhart" <[hidden email]>
Sent: Tuesday, May 09, 2006 1:27 PM
Subject: Re: [R] large data set, error: cannot allocate vector


>
> On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:
>
>> 1) So the original problem remains unsolved?
>
> The question was answered but the problem remains unsolved.  The  question
> was, why am I getting an error "cannot allocate vector" when  reading in a
> 100 MM integer list.  The answer appears to be:
>
> 1) R loads the entire data set into RAM
> 2) on a 32-bit system R max'es out at 3 GB
> 3) loading 100 MM integer entries into a data.frame requires more  than 3
> GB of RAM (5-10 GB based on projections from 10 MM entries)
>
> So, the new question is, how does one work around such limits?
>
>> You can load data but lack memory to do more (or so it appears). It
>> seems to me that your options are:
>>    a) ensure that the --max-mem-size option is allowing R to  utilize all
>> available RAM
>
> --max-mem-size doesn't exist in my version:
>
> $ R --max-mem-size
> WARNING: unknown option '--max-mem-size'
>
> Do different versions of R on different OSes and different platforms  have
> different options?
>
> FWIW, here's the usage statement from ?mem.limits:
>
> R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max-
> ppsize=N
>
>>    b) sample if possible, i.e. are 20MM necessary
>
> Yes, or within a factor of 4 of that.
>
>>    c) load in matrices or vectors, then "process" or analyze
>
> Yes, I just need to learn more of the R language to do what I want.
>
>>    d) load data in database that R connects to, use that engine for
>> processing
>
> I have a gut feeling something like this is the way to go.
>
>>    e) drop unnecessary columns from data.frame
>
> Yes.  Currently, one of the fields is an identifier field which is a  long
> text field (30+ chars).  That should probably be converted to an  integer
> to conserve on both time and space.
>
>>    f) analyze subsets of the data (variable-wise--review fewer vars  at a
>> time)
>
> Possibly.
>
>>    g) buy more RAM (32 vs 64 bit architecture should not be the  issue,
>> since you use LINUX)
>
> 32-bit seems to be the limit.  We've got 6 GB of RAM and 8 GB of  swap.
> Despite that R chokes well before those limits are reached.
>
>>    h) ???
>
> Yes, possibly some other solution we haven't considered.
>
>> 2) Not finding memory.limit() is very odd.  You should consider
>> reviewing the bug reporting process to determine if this should be
>> reported.  Here's an example of my output.
>>    > memory.limit()
>>    [1] 1782579200
>
> Do different versions of R on different OSes and different platforms  have
> different functions?
>
>> 3) This may not be the correct way to look at the timing  differences you
>> experienced. However, it seems R is holding up well.
>>
>>                    10MM  100MM  ratio-100MM/10MM
>>           cat      0.04   7.60  190.00
>>          scan      9.93  92.27    9.29
>> ratio scan/cat    248.25  12.14
>
> I re-ran the timing test for the 100 MM file taking caching into  account.
> Linux with 6 GB has no problem caching the 100 MM file (600  MB):
>
>                     10MM    100MM  ratio-100MM/10MM
>           cat       0.04     0.38    9.50
>          scan       9.93    92.27    9.29
> ratio scan/cat    248.25   242.82
>
>> Please let me know how you resolve.  I'm curious about your solution
>> HTH,
>
> Indeed, very helpful.  I'm learning more about R every day.  Thanks  for
> your feedback.
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software.  Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: large data set, error: cannot allocate vector

Jason Barnhart
OK, I'm a conehead.

There's no memory.limit() on my LINUX setup; neither is there a
--max-memory-size option.

Sorry for any false trails.

-jason


>From: "Jason Barnhart" <[hidden email]>
>To: "Robert Citek" <[hidden email]>,
><[hidden email]>
>Subject: Re: [R] large data set, error: cannot allocate vector
>Date: Tue, 9 May 2006 14:32:45 -0700
>
>Robert,
>
>Thanks, I stand corrected on the RAM issue re: 32 vs. 64 bit builds.
>
>As for the --max-memory-size option, I'll try to check my LINUX version at
>home tonight.
>
>-jason
>
>----- Original Message -----
>From: "Robert Citek" <[hidden email]>
>To: <[hidden email]>
>Cc: "Jason Barnhart" <[hidden email]>
>Sent: Tuesday, May 09, 2006 1:27 PM
>Subject: Re: [R] large data set, error: cannot allocate vector
>
>
> >
> > On May 9, 2006, at 1:32 PM, Jason Barnhart wrote:
> >
> >> 1) So the original problem remains unsolved?
> >
> > The question was answered but the problem remains unsolved.  The  
>question
> > was, why am I getting an error "cannot allocate vector" when  reading in
>a
> > 100 MM integer list.  The answer appears to be:
> >
> > 1) R loads the entire data set into RAM
> > 2) on a 32-bit system R max'es out at 3 GB
> > 3) loading 100 MM integer entries into a data.frame requires more  than
>3
> > GB of RAM (5-10 GB based on projections from 10 MM entries)
> >
> > So, the new question is, how does one work around such limits?
> >
> >> You can load data but lack memory to do more (or so it appears). It
> >> seems to me that your options are:
> >>    a) ensure that the --max-mem-size option is allowing R to  utilize
>all
> >> available RAM
> >
> > --max-mem-size doesn't exist in my version:
> >
> > $ R --max-mem-size
> > WARNING: unknown option '--max-mem-size'
> >
> > Do different versions of R on different OSes and different platforms  
>have
> > different options?
> >
> > FWIW, here's the usage statement from ?mem.limits:
> >
> > R --min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu --max-
> > ppsize=N
> >
> >>    b) sample if possible, i.e. are 20MM necessary
> >
> > Yes, or within a factor of 4 of that.
> >
> >>    c) load in matrices or vectors, then "process" or analyze
> >
> > Yes, I just need to learn more of the R language to do what I want.
> >
> >>    d) load data in database that R connects to, use that engine for
> >> processing
> >
> > I have a gut feeling something like this is the way to go.
> >
> >>    e) drop unnecessary columns from data.frame
> >
> > Yes.  Currently, one of the fields is an identifier field which is a  
>long
> > text field (30+ chars).  That should probably be converted to an  
>integer
> > to conserve on both time and space.
> >
> >>    f) analyze subsets of the data (variable-wise--review fewer vars  at
>a
> >> time)
> >
> > Possibly.
> >
> >>    g) buy more RAM (32 vs 64 bit architecture should not be the  issue,
> >> since you use LINUX)
> >
> > 32-bit seems to be the limit.  We've got 6 GB of RAM and 8 GB of  swap.
> > Despite that R chokes well before those limits are reached.
> >
> >>    h) ???
> >
> > Yes, possibly some other solution we haven't considered.
> >
> >> 2) Not finding memory.limit() is very odd.  You should consider
> >> reviewing the bug reporting process to determine if this should be
> >> reported.  Here's an example of my output.
> >>    > memory.limit()
> >>    [1] 1782579200
> >
> > Do different versions of R on different OSes and different platforms  
>have
> > different functions?
> >
> >> 3) This may not be the correct way to look at the timing  differences
>you
> >> experienced. However, it seems R is holding up well.
> >>
> >>                    10MM  100MM  ratio-100MM/10MM
> >>           cat      0.04   7.60  190.00
> >>          scan      9.93  92.27    9.29
> >> ratio scan/cat    248.25  12.14
> >
> > I re-ran the timing test for the 100 MM file taking caching into  
>account.
> > Linux with 6 GB has no problem caching the 100 MM file (600  MB):
> >
> >                     10MM    100MM  ratio-100MM/10MM
> >           cat       0.04     0.38    9.50
> >          scan       9.93    92.27    9.29
> > ratio scan/cat    248.25   242.82
> >
> >> Please let me know how you resolve.  I'm curious about your solution
> >> HTH,
> >
> > Indeed, very helpful.  I'm learning more about R every day.  Thanks  for
> > your feedback.
> >
> > Regards,
> > - Robert
> > http://www.cwelug.org/downloads
> > Help others get OpenSource software.  Distribute FLOSS
> > for Windows, Linux, *BSD, and MacOS X with BitTorrent
> >
> >
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
12