readLines() segfaults on large file & question on how to work around

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

readLines() segfaults on large file & question on how to work around

Jennifer Lyon
Hi:

I have a 2.1GB JSON file. Typically I use readLines() and
jsonlite:fromJSON() to extract data from a JSON file.

When I try and read in this file using readLines() R segfaults.

I believe the two salient issues with this file are
1). Its size
2). It is a single line (no line breaks)

I can reproduce this issue as follows
#Generate a big file with no line breaks
# In R
> writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")

# in unix shell
cp alpha.txt file.txt
for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
file.txt; done

This generates a 2.3GB file with no line breaks

in R:
> moo <- readLines("file.txt")

 *** caught segfault ***
address 0x7cffffff, cause 'memory not mapped'

Traceback:
 1: readLines("file.txt")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3

I conclude:
 I am potentially running up against a limit in R, which should give a
reasonable error, but currently just segfaults.

My question:
Most of the content of the JSON is an approximately 100K x 6K JSON
equivalent of a dataframe, and I know R can handle much bigger than this
size. I am expecting these JSON files to get even larger. My R code lives
in a bigger system, and the JSON comes in via stdin, so I have absolutely
no control over the data format. I can imagine trying to incrementally
parse the JSON so I don't bump up against the limit, but I am eager for
suggestions of simpler solutions.

Also, I apologize for the timing of this bug report, as I know folks are
working to get out the next release of R, but like so many things I have no
control over when bugs leap up.

Thanks.

Jen

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: R-3.4.1/lib/libRblas.so
LAPACK:R-3.4.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.1

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Ista Zahn
As s work-around I  suggest readr::read_file.

--Ista


On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <[hidden email]> wrote:

> Hi:
>
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.
>
> When I try and read in this file using readLines() R segfaults.
>
> I believe the two salient issues with this file are
> 1). Its size
> 2). It is a single line (no line breaks)
>
> I can reproduce this issue as follows
> #Generate a big file with no line breaks
> # In R
> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>
> # in unix shell
> cp alpha.txt file.txt
> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
> file.txt; done
>
> This generates a 2.3GB file with no line breaks
>
> in R:
> > moo <- readLines("file.txt")
>
>  *** caught segfault ***
> address 0x7cffffff, cause 'memory not mapped'
>
> Traceback:
>  1: readLines("file.txt")
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection: 3
>
> I conclude:
>  I am potentially running up against a limit in R, which should give a
> reasonable error, but currently just segfaults.
>
> My question:
> Most of the content of the JSON is an approximately 100K x 6K JSON
> equivalent of a dataframe, and I know R can handle much bigger than this
> size. I am expecting these JSON files to get even larger. My R code lives
> in a bigger system, and the JSON comes in via stdin, so I have absolutely
> no control over the data format. I can imagine trying to incrementally
> parse the JSON so I don't bump up against the limit, but I am eager for
> suggestions of simpler solutions.
>
> Also, I apologize for the timing of this bug report, as I know folks are
> working to get out the next release of R, but like so many things I have no
> control over when bugs leap up.
>
> Thanks.
>
> Jen
>
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>
> Matrix products: default
> BLAS: R-3.4.1/lib/libRblas.so
> LAPACK:R-3.4.1/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Jennifer Lyon
Thank you for your suggestion. Unfortunately, while R doesn't segfault
calling readr::read_file() on the test file I described, I get the error
message:

Error in read_file_(ds, locale) : negative length vectors are not allowed

Jen

On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <[hidden email]> wrote:

> As s work-around I  suggest readr::read_file.
>
> --Ista
>
>
> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <[hidden email]> wrote:
>
>> Hi:
>>
>> I have a 2.1GB JSON file. Typically I use readLines() and
>> jsonlite:fromJSON() to extract data from a JSON file.
>>
>> When I try and read in this file using readLines() R segfaults.
>>
>> I believe the two salient issues with this file are
>> 1). Its size
>> 2). It is a single line (no line breaks)
>>
>> I can reproduce this issue as follows
>> #Generate a big file with no line breaks
>> # In R
>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>
>> # in unix shell
>> cp alpha.txt file.txt
>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>> file.txt; done
>>
>> This generates a 2.3GB file with no line breaks
>>
>> in R:
>> > moo <- readLines("file.txt")
>>
>>  *** caught segfault ***
>> address 0x7cffffff, cause 'memory not mapped'
>>
>> Traceback:
>>  1: readLines("file.txt")
>>
>> Possible actions:
>> 1: abort (with core dump, if enabled)
>> 2: normal R exit
>> 3: exit R without saving workspace
>> 4: exit R saving workspace
>> Selection: 3
>>
>> I conclude:
>>  I am potentially running up against a limit in R, which should give a
>> reasonable error, but currently just segfaults.
>>
>> My question:
>> Most of the content of the JSON is an approximately 100K x 6K JSON
>> equivalent of a dataframe, and I know R can handle much bigger than this
>> size. I am expecting these JSON files to get even larger. My R code lives
>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>> no control over the data format. I can imagine trying to incrementally
>> parse the JSON so I don't bump up against the limit, but I am eager for
>> suggestions of simpler solutions.
>>
>> Also, I apologize for the timing of this bug report, as I know folks are
>> working to get out the next release of R, but like so many things I have
>> no
>> control over when bugs leap up.
>>
>> Thanks.
>>
>> Jen
>>
>> > sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.5 LTS
>>
>> Matrix products: default
>> BLAS: R-3.4.1/lib/libRblas.so
>> LAPACK:R-3.4.1/lib/libRlapack.so
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.1
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Iñaki Úcar
In reply to this post by Jennifer Lyon
2017-09-02 20:58 GMT+02:00 Jennifer Lyon <[hidden email]>:

> Hi:
>
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.
>
> When I try and read in this file using readLines() R segfaults.
>
> I believe the two salient issues with this file are
> 1). Its size
> 2). It is a single line (no line breaks)

As a workaround you can pipe something like "sed s/,/,\\n/g" before
your R script to insert line breaks.

Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Suzen, Mehmet
In reply to this post by Jennifer Lyon
Jennifer, Why do you try Sparkr?

https://spark.apache.org/docs/1.6.1/api/R/read.json.html

On 2 September 2017 at 23:15, Jennifer Lyon <[hidden email]> wrote:

> Thank you for your suggestion. Unfortunately, while R doesn't segfault
> calling readr::read_file() on the test file I described, I get the error
> message:
>
> Error in read_file_(ds, locale) : negative length vectors are not allowed
>
> Jen
>
> On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <[hidden email]> wrote:
>
>> As s work-around I  suggest readr::read_file.
>>
>> --Ista
>>
>>
>> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <[hidden email]> wrote:
>>
>>> Hi:
>>>
>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>> jsonlite:fromJSON() to extract data from a JSON file.
>>>
>>> When I try and read in this file using readLines() R segfaults.
>>>
>>> I believe the two salient issues with this file are
>>> 1). Its size
>>> 2). It is a single line (no line breaks)
>>>
>>> I can reproduce this issue as follows
>>> #Generate a big file with no line breaks
>>> # In R
>>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>>
>>> # in unix shell
>>> cp alpha.txt file.txt
>>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>>> file.txt; done
>>>
>>> This generates a 2.3GB file with no line breaks
>>>
>>> in R:
>>> > moo <- readLines("file.txt")
>>>
>>>  *** caught segfault ***
>>> address 0x7cffffff, cause 'memory not mapped'
>>>
>>> Traceback:
>>>  1: readLines("file.txt")
>>>
>>> Possible actions:
>>> 1: abort (with core dump, if enabled)
>>> 2: normal R exit
>>> 3: exit R without saving workspace
>>> 4: exit R saving workspace
>>> Selection: 3
>>>
>>> I conclude:
>>>  I am potentially running up against a limit in R, which should give a
>>> reasonable error, but currently just segfaults.
>>>
>>> My question:
>>> Most of the content of the JSON is an approximately 100K x 6K JSON
>>> equivalent of a dataframe, and I know R can handle much bigger than this
>>> size. I am expecting these JSON files to get even larger. My R code lives
>>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>>> no control over the data format. I can imagine trying to incrementally
>>> parse the JSON so I don't bump up against the limit, but I am eager for
>>> suggestions of simpler solutions.
>>>
>>> Also, I apologize for the timing of this bug report, as I know folks are
>>> working to get out the next release of R, but like so many things I have
>>> no
>>> control over when bugs leap up.
>>>
>>> Thanks.
>>>
>>> Jen
>>>
>>> > sessionInfo()
>>> R version 3.4.1 (2017-06-30)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 14.04.5 LTS
>>>
>>> Matrix products: default
>>> BLAS: R-3.4.1/lib/libRblas.so
>>> LAPACK:R-3.4.1/lib/libRlapack.so
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_3.4.1
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Jeroen Ooms
In reply to this post by Jennifer Lyon
On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <[hidden email]> wrote:
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.

If your data consists of one json object per line, this is called
'ndjson'. There are several packages specialized to read ndjon files:

 - corpus::read_ndjson
 - ndjson::stream_in
 - jsonlite::stream_in

In particular the 'corpus' package handles large files really well
because it has an option to memory-map the file instead of reading all
of its data into memory.

If the data is too large to read, you can preprocess it using
https://stedolan.github.io/jq/ to extract the fields that you need.

You really don't need hadoop/spark/etc for this.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Jennifer Lyon
Jeroen:

Thank you for pointing me to ndjson, which I had not heard of and is
exactly my case.

My experience:
jsonlite::stream_in - segfaults
ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
      so it won't compile the package
corpus::read_ndjson - works!!! Of course it does a different simplification
     than jsonlite::fromJSON, so I have to change some code, but it works
     beautifully at least in simple tests. The memory-map option may be of
     use in the future.

Another correspondent said that strings in R can only be 2^31-1 long, which
is why any "solution" that tries to load the whole file into R first as a
string, will fail.

Thanks for suggesting a path forward for me!

Jen

On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <[hidden email]> wrote:

> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <[hidden email]>
> wrote:
> > I have a 2.1GB JSON file. Typically I use readLines() and
> > jsonlite:fromJSON() to extract data from a JSON file.
>
> If your data consists of one json object per line, this is called
> 'ndjson'. There are several packages specialized to read ndjon files:
>
>  - corpus::read_ndjson
>  - ndjson::stream_in
>  - jsonlite::stream_in
>
> In particular the 'corpus' package handles large files really well
> because it has an option to memory-map the file instead of reading all
> of its data into memory.
>
> If the data is too large to read, you can preprocess it using
> https://stedolan.github.io/jq/ to extract the fields that you need.
>
> You really don't need hadoop/spark/etc for this.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Jan van der LAan-2
Although the problem can apparently be avoided in this case. readLines
causing a segfault still seems unwanted behaviour to me. I can replicate
this with the example below (sessionInfo is further down):


# Generate an example file
l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
   collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
   writeLines(l, con, sep ="")
}
close(con)


# Causes segfault:
readLines("test.txt")

Also the error reported by readr is also reproduced (a more informative
error message and checking for integer overflows would be nice). I will
report this with readr.

library(readr)
read_file("test.txt")
# Error in read_file_(ds, locale) : negative length vectors are not
# allowed


--
Jan








 > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=nl_NL.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8
LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C

[10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1
tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2







On 03-09-17 20:50, Jennifer Lyon wrote:

> Jeroen:
>
> Thank you for pointing me to ndjson, which I had not heard of and is
> exactly my case.
>
> My experience:
> jsonlite::stream_in - segfaults
> ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
>        so it won't compile the package
> corpus::read_ndjson - works!!! Of course it does a different simplification
>       than jsonlite::fromJSON, so I have to change some code, but it works
>       beautifully at least in simple tests. The memory-map option may be of
>       use in the future.
>
> Another correspondent said that strings in R can only be 2^31-1 long, which
> is why any "solution" that tries to load the whole file into R first as a
> string, will fail.
>
> Thanks for suggesting a path forward for me!
>
> Jen
>
> On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <[hidden email]> wrote:
>
>> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <[hidden email]>
>> wrote:
>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>> jsonlite:fromJSON() to extract data from a JSON file.
>>
>> If your data consists of one json object per line, this is called
>> 'ndjson'. There are several packages specialized to read ndjon files:
>>
>>   - corpus::read_ndjson
>>   - ndjson::stream_in
>>   - jsonlite::stream_in
>>
>> In particular the 'corpus' package handles large files really well
>> because it has an option to memory-map the file instead of reading all
>> of its data into memory.
>>
>> If the data is too large to read, you can preprocess it using
>> https://stedolan.github.io/jq/ to extract the fields that you need.
>>
>> You really don't need hadoop/spark/etc for this.
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: readLines() segfaults on large file & question on how to work around

Tomas Kalibera
As of R-devel 72925 one gets a proper error message instead of the crash.

Tomas


On 09/04/2017 08:46 AM, [hidden email] wrote:

> Although the problem can apparently be avoided in this case. readLines
> causing a segfault still seems unwanted behaviour to me. I can
> replicate this with the example below (sessionInfo is further down):
>
>
> # Generate an example file
> l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
>   collapse="")
> con <- file("test.txt", "wt")
> for (i in seq_len(2500)) {
>   writeLines(l, con, sep ="")
> }
> close(con)
>
>
> # Causes segfault:
> readLines("test.txt")
>
> Also the error reported by readr is also reproduced (a more
> informative error message and checking for integer overflows would be
> nice). I will report this with readr.
>
> library(readr)
> read_file("test.txt")
> # Error in read_file_(ds, locale) : negative length vectors are not
> # allowed
>
>
> --
> Jan
>
>
>
>
>
>
>
>
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 17.04
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.7.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C LC_TIME=nl_NL.UTF-8
>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8
> LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> other attached packages:
> [1] readr_1.1.1
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1
> tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2
>
>
>
>
>
>
>
> On 03-09-17 20:50, Jennifer Lyon wrote:
>> Jeroen:
>>
>> Thank you for pointing me to ndjson, which I had not heard of and is
>> exactly my case.
>>
>> My experience:
>> jsonlite::stream_in - segfaults
>> ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too
>> old
>>        so it won't compile the package
>> corpus::read_ndjson - works!!! Of course it does a different
>> simplification
>>       than jsonlite::fromJSON, so I have to change some code, but it
>> works
>>       beautifully at least in simple tests. The memory-map option may
>> be of
>>       use in the future.
>>
>> Another correspondent said that strings in R can only be 2^31-1 long,
>> which
>> is why any "solution" that tries to load the whole file into R first
>> as a
>> string, will fail.
>>
>> Thanks for suggesting a path forward for me!
>>
>> Jen
>>
>> On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <[hidden email]>
>> wrote:
>>
>>> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon
>>> <[hidden email]>
>>> wrote:
>>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>>> jsonlite:fromJSON() to extract data from a JSON file.
>>>
>>> If your data consists of one json object per line, this is called
>>> 'ndjson'. There are several packages specialized to read ndjon files:
>>>
>>>   - corpus::read_ndjson
>>>   - ndjson::stream_in
>>>   - jsonlite::stream_in
>>>
>>> In particular the 'corpus' package handles large files really well
>>> because it has an option to memory-map the file instead of reading all
>>> of its data into memory.
>>>
>>> If the data is too large to read, you can preprocess it using
>>> https://stedolan.github.io/jq/ to extract the fields that you need.
>>>
>>> You really don't need hadoop/spark/etc for this.
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel