R crashes when using huge data sets with character string variables

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

R crashes when using huge data sets with character string variables

Arne Henningsen-3
When working with a huge data set with character string variables, I
experienced that various commands let R crash. When I run R in a
Linux/bash console, R terminates with the message "Killed". When I use
RStudio, I get the message "R Session Aborted. R encountered a fatal
error. The session was terminated. Start New Session". If an object in
the R workspace needs too much memory, I would expect that R would not
crash but issue an error message "Error: cannot allocate vector of
size ...".  A minimal reproducible example (at least on my computer)
is:

nObs <- 1e9

date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )

Is this a bug or a feature of R?

Some information about my R version, OS, etc:

R> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
[5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
[7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.3

/Arne

--
Arne Henningsen
http://www.arne-henningsen.name

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: R crashes when using huge data sets with character string variables

bbolker
   On Windows you can use memory.limit.

https://stackoverflow.com/questions/12582793/limiting-memory-usage-in-r-under-linux

    Not sure how much that helps.

On 12/12/20 6:19 PM, Arne Henningsen wrote:

> When working with a huge data set with character string variables, I
> experienced that various commands let R crash. When I run R in a
> Linux/bash console, R terminates with the message "Killed". When I use
> RStudio, I get the message "R Session Aborted. R encountered a fatal
> error. The session was terminated. Start New Session". If an object in
> the R workspace needs too much memory, I would expect that R would not
> crash but issue an error message "Error: cannot allocate vector of
> size ...".  A minimal reproducible example (at least on my computer)
> is:
>
> nObs <- 1e9
>
> date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
> 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
>
> Is this a bug or a feature of R?
>
> Some information about my R version, OS, etc:
>
> R> sessionInfo()
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.1 LTS
>
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
>
> locale:
> [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
> [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
> [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.3
>
> /Arne
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: R crashes when using huge data sets with character string variables

R devel mailing list
 > On Saturday, December 12, 2020, 6:33:33 PM EST, Ben Bolker <[hidden email]> wrote:

>
>  On Windows you can use memory.limit.
>
> https://stackoverflow.com/questions/12582793/limiting-memory-usage-in-r-under-linux
>
>    Not sure how much that helps.
>
>On 12/12/20 6:19 PM, Arne Henningsen wrote:
>> When working with a huge data set with character string variables, I
>> experienced that various commands let R crash. When I run R in a
>> Linux/bash console, R terminates with the message "Killed". When I use
>> RStudio, I get the message "R Session Aborted. R encountered a fatal
>> error. The session was terminated. Start New Session". If an object in
>> the R workspace needs too much memory, I would expect that R would not
>> crash but issue an error message "Error: cannot allocate vector of
>> size ...".  A minimal reproducible example (at least on my computer)
>> is:
>>
>> nObs <- 1e9
>>
>> date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
>> 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
>>
>> Is this a bug or a feature of R?

On OS X I see:

    > nObs <- 1e9
    >  date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
    Error: vector memory exhausted (limit reached?)
    > sessionInfo()
    R version 4.0.3 (2020-10-10)
    Platform: x86_64-apple-darwin17.0 (64-bit)
    Running under: macOS Catalina 10.15.7

Which is what I would expect.  I don't doubt the error you've seen, just
providing a data point for whoever ends up looking into this further.

Best,

Brodie.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] R crashes when using huge data sets with character string variables

luke-tierney
In reply to this post by Arne Henningsen-3
If R is receiving a kill signal there is nothing it can do about it.

I am guessing you are running into a memory over-commit issue in your OS.
https://en.wikipedia.org/wiki/Memory_overcommitment
https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

If you have to run this close to your physical memory limits you might
try using your shell's facility (ulimit for bash, limit for some
others) to limit process memory/virtual memory use to your available
physical memory. You can also try setting the R_MAX_VSIZE environment
variable mentioned in ?Memory; that only affects the R heap, not
malloc() done elsewhere.

Best,

luke

On Sat, 12 Dec 2020, Arne Henningsen wrote:

> When working with a huge data set with character string variables, I
> experienced that various commands let R crash. When I run R in a
> Linux/bash console, R terminates with the message "Killed". When I use
> RStudio, I get the message "R Session Aborted. R encountered a fatal
> error. The session was terminated. Start New Session". If an object in
> the R workspace needs too much memory, I would expect that R would not
> crash but issue an error message "Error: cannot allocate vector of
> size ...".  A minimal reproducible example (at least on my computer)
> is:
>
> nObs <- 1e9
>
> date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
> 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
>
> Is this a bug or a feature of R?
>
> Some information about my R version, OS, etc:
>
> R> sessionInfo()
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.1 LTS
>
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
>
> locale:
> [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
> [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
> [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.3
>
> /Arne
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] R crashes when using huge data sets with character string variables

Dirk Eddelbuettel

On 12 December 2020 at 21:26, [hidden email] wrote:
| If R is receiving a kill signal there is nothing it can do about it.
|
| I am guessing you are running into a memory over-commit issue in your OS.
| https://en.wikipedia.org/wiki/Memory_overcommitment
| https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/
|
| If you have to run this close to your physical memory limits you might
| try using your shell's facility (ulimit for bash, limit for some
| others) to limit process memory/virtual memory use to your available
| physical memory. You can also try setting the R_MAX_VSIZE environment
| variable mentioned in ?Memory; that only affects the R heap, not
| malloc() done elsewhere.

Similarly, as it is Linux, you could (easily) add virtual memory via a
swapfile (see 'man 8 swapfile' and 'man 8 swapon').  But even then, I expect
this to be slow -- 1e9 is a lot.

I have 32gb and ample swap (which is rarely used, but a safety net). When I
use your code with nObs <- 1e8 it ends up with about 6gb which poses poses no
problem, but already takes 3 1/2 minutes:

> nObs <- 1e8
> system.time(date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs, 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" ))
   user  system elapsed
203.723   1.779 205.528
>

You may want to play with the nObs value to see exactly where it breaks on
your box.

Dirk

--
https://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] R crashes when using huge data sets with character string variables

Iñaki Ucar
In reply to this post by luke-tierney
On Sun, 13 Dec 2020 at 04:27, <[hidden email]> wrote:
>
> If R is receiving a kill signal there is nothing it can do about it.
>
> I am guessing you are running into a memory over-commit issue in your OS.
> https://en.wikipedia.org/wiki/Memory_overcommitment
> https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

Correct. And in particular, this is most probably the earlyoom [1]
service in action, which, I believe, is installed and enabled by
default in Ubuntu 20.04. It is a simple daemon that monitors memory,
and when some conditions are reached (e.g., the system is about to
start swapping), it looks for offending processes and kills them.

[1] https://github.com/rfjakob/earlyoom

Iñaki

> If you have to run this close to your physical memory limits you might
> try using your shell's facility (ulimit for bash, limit for some
> others) to limit process memory/virtual memory use to your available
> physical memory. You can also try setting the R_MAX_VSIZE environment
> variable mentioned in ?Memory; that only affects the R heap, not
> malloc() done elsewhere.
>
> Best,
>
> luke
>
> On Sat, 12 Dec 2020, Arne Henningsen wrote:
>
> > When working with a huge data set with character string variables, I
> > experienced that various commands let R crash. When I run R in a
> > Linux/bash console, R terminates with the message "Killed". When I use
> > RStudio, I get the message "R Session Aborted. R encountered a fatal
> > error. The session was terminated. Start New Session". If an object in
> > the R workspace needs too much memory, I would expect that R would not
> > crash but issue an error message "Error: cannot allocate vector of
> > size ...".  A minimal reproducible example (at least on my computer)
> > is:
> >
> > nObs <- 1e9
> >
> > date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
> > 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
> >
> > Is this a bug or a feature of R?
> >
> > Some information about my R version, OS, etc:
> >
> > R> sessionInfo()
> > R version 4.0.3 (2020-10-10)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu 20.04.1 LTS
> >
> > Matrix products: default
> > BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> >
> > locale:
> > [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
> > [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
> > [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
> > [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
> > [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.0.3
> >
> > /Arne
> >
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] R crashes when using huge data sets with character string variables

Arne Henningsen-3
Dear all

Thanks a lot for your very helpful explanations and suggestions. I
have increased the size of my computer's "swapfile" and this solved my
problem, i.e., R no longer crashes when I work with character string
variables in my large data set (probably until I work with an even
larger data set).

Best wishes,
Arne



On Sun, 13 Dec 2020 at 11:17, Iñaki Ucar <[hidden email]> wrote:

>
> On Sun, 13 Dec 2020 at 04:27, <[hidden email]> wrote:
> >
> > If R is receiving a kill signal there is nothing it can do about it.
> >
> > I am guessing you are running into a memory over-commit issue in your OS.
> > https://en.wikipedia.org/wiki/Memory_overcommitment
> > https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/
>
> Correct. And in particular, this is most probably the earlyoom [1]
> service in action, which, I believe, is installed and enabled by
> default in Ubuntu 20.04. It is a simple daemon that monitors memory,
> and when some conditions are reached (e.g., the system is about to
> start swapping), it looks for offending processes and kills them.
>
> [1] https://github.com/rfjakob/earlyoom
>
> Iñaki
>
> > If you have to run this close to your physical memory limits you might
> > try using your shell's facility (ulimit for bash, limit for some
> > others) to limit process memory/virtual memory use to your available
> > physical memory. You can also try setting the R_MAX_VSIZE environment
> > variable mentioned in ?Memory; that only affects the R heap, not
> > malloc() done elsewhere.
> >
> > Best,
> >
> > luke
> >
> > On Sat, 12 Dec 2020, Arne Henningsen wrote:
> >
> > > When working with a huge data set with character string variables, I
> > > experienced that various commands let R crash. When I run R in a
> > > Linux/bash console, R terminates with the message "Killed". When I use
> > > RStudio, I get the message "R Session Aborted. R encountered a fatal
> > > error. The session was terminated. Start New Session". If an object in
> > > the R workspace needs too much memory, I would expect that R would not
> > > crash but issue an error message "Error: cannot allocate vector of
> > > size ...".  A minimal reproducible example (at least on my computer)
> > > is:
> > >
> > > nObs <- 1e9
> > >
> > > date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
> > > 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
> > >
> > > Is this a bug or a feature of R?
> > >
> > > Some information about my R version, OS, etc:
> > >
> > > R> sessionInfo()
> > > R version 4.0.3 (2020-10-10)
> > > Platform: x86_64-pc-linux-gnu (64-bit)
> > > Running under: Ubuntu 20.04.1 LTS
> > >
> > > Matrix products: default
> > > BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
> > > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
> > >
> > > locale:
> > > [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
> > > [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
> > > [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
> > > [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
> > > [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > > [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] stats     graphics  grDevices utils     datasets  methods   base
> > >
> > > loaded via a namespace (and not attached):
> > > [1] compiler_4.0.3
> > >
> > > /Arne
> > >
> > >
> >
> > --
> > Luke Tierney
> > Ralph E. Wareham Professor of Mathematical Sciences
> > University of Iowa                  Phone:             319-335-3386
> > Department of Statistics and        Fax:               319-335-3017
> >     Actuarial Science
> > 241 Schaeffer Hall                  email:   [hidden email]
> > Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> Iñaki Úcar



--
Arne Henningsen
http://www.arne-henningsen.name

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel