readLines without skipNul=TRUE causes crash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

readLines without skipNul=TRUE causes crash

ajdamico
hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others have
advice?  thanks





install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath( file_folder ) )

unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
TRUE  )

infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

# works
R.utils::countLines( infile )

# works with warning
my_file <- readLines( infile , skipNul = TRUE )

# crash
my_file <- readLines( infile )


# run just before crash
sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 15063)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base

# loaded via a namespace (and not attached):
 # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1           withr_1.0.2
 # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
memoise_1.1.0
 # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12      lodown_0.1.0
# [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2    R.oo_1.21.0
# [17] archive_0.0.0.9000

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Duncan Murdoch-2
On 15/07/2017 7:35 AM, Anthony Damico wrote:
> hello, the last line of the code below causes a segfault for me on 3.4.1.
> i think i should submit to https://bugs.r-project.org/  unless others have
> advice?  thanks

Segfaults are usually worth reporting as bugs.  Try to come up with a
self-contained example, not using the lodown and archive packages.  I
imagine you can do this by uploading the file you downloaded, or enough
of a subset of it to trigger the segfault.  If you can't do that, then
likely the bug is with one of those packages, not with R.

Duncan Murdoch

>
>
>
>
>
> install.packages( "devtools" )
> devtools::install_github("ajdamico/lodown")
> devtools::install_github("jimhester/archive")
>
>
> file_folder <- file.path( tempdir() , "file_folder" )
>
> tf <- tempfile()
>
> # large download!  cachaca saves on your local disk if already downloaded
> lodown::cachaca( '
> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
> = 'wb' )
>
> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>
> unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
> TRUE  )
>
> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>
> # works
> R.utils::countLines( infile )
>
> # works with warning
> my_file <- readLines( infile , skipNul = TRUE )
>
> # crash
> my_file <- readLines( infile )
>
>
> # run just before crash
> sessionInfo()
> # R version 3.4.1 (2017-06-30)
> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> # Running under: Windows 10 x64 (build 15063)
>
> # Matrix products: default
>
> # locale:
> # [1] LC_COLLATE=English_United States.1252
> # [2] LC_CTYPE=English_United States.1252
> # [3] LC_MONETARY=English_United States.1252
> # [4] LC_NUMERIC=C
> # [5] LC_TIME=English_United States.1252
>
> # attached base packages:
> # [1] stats     graphics  grDevices utils     datasets  methods   base
>
> # loaded via a namespace (and not attached):
>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1           withr_1.0.2
>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
> memoise_1.1.0
>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12      lodown_0.1.0
> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2    R.oo_1.21.0
> # [17] archive_0.0.0.9000
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

    > file.size(infile)
    [1] 4078192743


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

    > file_con <- file( infile , 'r' )
    >
    > first_80936_lines <- readLines( file_con , n = 80936 )
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "1000023930632009"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "36F2924009PAULO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "AFONSO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "BA11"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "00000"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "00"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "2924009PAULO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "AFONSO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "BA1111"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "467.20"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "346.10"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "414.40"
    > scan( w , n = 1 , what = character() )
    Error in scan(w, n = 1, what = character()) :
      could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


    # WARNING do not run with less than 64GB RAM
    tf <- tempfile()
    a <- rep( "a" , 1000000000 )
    b <- paste( a , collapse = '' )
    writeLines( b , tf ) ; rm( b ) ; gc()
    d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <[hidden email]>
wrote:

> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>> i think i should submit to https://bugs.r-project.org/  unless others
>> have
>> advice?  thanks
>>
>
> Segfaults are usually worth reporting as bugs.  Try to come up with a
> self-contained example, not using the lodown and archive packages.  I
> imagine you can do this by uploading the file you downloaded, or enough of
> a subset of it to trigger the segfault.  If you can't do that, then likely
> the bug is with one of those packages, not with R.
>
> Duncan Murdoch
>
>
>>
>>
>>
>>
>> install.packages( "devtools" )
>> devtools::install_github("ajdamico/lodown")
>> devtools::install_github("jimhester/archive")
>>
>>
>> file_folder <- file.path( tempdir() , "file_folder" )
>>
>> tf <- tempfile()
>>
>> # large download!  cachaca saves on your local disk if already downloaded
>> lodown::cachaca( '
>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>> mode
>> = 'wb' )
>>
>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>
>> unzipped_files <- list.files( file_folder , recursive = TRUE , full.names
>> =
>> TRUE  )
>>
>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>
>> # works
>> R.utils::countLines( infile )
>>
>> # works with warning
>> my_file <- readLines( infile , skipNul = TRUE )
>>
>> # crash
>> my_file <- readLines( infile )
>>
>>
>> # run just before crash
>> sessionInfo()
>> # R version 3.4.1 (2017-06-30)
>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> # Running under: Windows 10 x64 (build 15063)
>>
>> # Matrix products: default
>>
>> # locale:
>> # [1] LC_COLLATE=English_United States.1252
>> # [2] LC_CTYPE=English_United States.1252
>> # [3] LC_MONETARY=English_United States.1252
>> # [4] LC_NUMERIC=C
>> # [5] LC_TIME=English_United States.1252
>>
>> # attached base packages:
>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> # loaded via a namespace (and not attached):
>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>  withr_1.0.2
>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> memoise_1.1.0
>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> lodown_0.1.0
>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>> R.oo_1.21.0
>> # [17] archive_0.0.0.9000
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]> wrote:

> hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>     > file.size(infile)
>     [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single line
> in the file.  here's what happens when i create a file connection and scan
> through..
>
>     > file_con <- file( infile , 'r' )
>     >
>     > first_80936_lines <- readLines( file_con , n = 80936 )
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "1000023930632009"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "36F2924009PAULO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "AFONSO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "BA11"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "00000"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "00"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "2924009PAULO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "AFONSO"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "BA1111"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "467.20"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "346.10"
>     > scan( w , n = 1 , what = character() )
>     Read 1 item
>     [1] "414.40"
>     > scan( w , n = 1 , what = character() )
>     Error in scan(w, n = 1, what = character()) :
>       could not allocate memory (2048 Mb) in C function
> 'R_AllocStringBuffer'
>
>
>
> making a huge single-line file does not reproduce the problem, i think the
> embedded nuls have something to do with it--
>
>
>     # WARNING do not run with less than 64GB RAM
>     tf <- tempfile()
>     a <- rep( "a" , 1000000000 )
>     b <- paste( a , collapse = '' )
>     writeLines( b , tf ) ; rm( b ) ; gc()
>     d <- readLines( tf )
>
>
>
> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <[hidden email]>
> wrote:
>
>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>
>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>> i think i should submit to https://bugs.r-project.org/  unless others
>>> have
>>> advice?  thanks
>>>
>>
>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>> self-contained example, not using the lodown and archive packages.  I
>> imagine you can do this by uploading the file you downloaded, or enough of
>> a subset of it to trigger the segfault.  If you can't do that, then likely
>> the bug is with one of those packages, not with R.
>>
>> Duncan Murdoch
>>
>>
>>>
>>>
>>>
>>>
>>> install.packages( "devtools" )
>>> devtools::install_github("ajdamico/lodown")
>>> devtools::install_github("jimhester/archive")
>>>
>>>
>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>
>>> tf <- tempfile()
>>>
>>> # large download!  cachaca saves on your local disk if already downloaded
>>> lodown::cachaca( '
>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>> mode
>>> = 'wb' )
>>>
>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>
>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>> full.names =
>>> TRUE  )
>>>
>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>
>>> # works
>>> R.utils::countLines( infile )
>>>
>>> # works with warning
>>> my_file <- readLines( infile , skipNul = TRUE )
>>>
>>> # crash
>>> my_file <- readLines( infile )
>>>
>>>
>>> # run just before crash
>>> sessionInfo()
>>> # R version 3.4.1 (2017-06-30)
>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> # Running under: Windows 10 x64 (build 15063)
>>>
>>> # Matrix products: default
>>>
>>> # locale:
>>> # [1] LC_COLLATE=English_United States.1252
>>> # [2] LC_CTYPE=English_United States.1252
>>> # [3] LC_MONETARY=English_United States.1252
>>> # [4] LC_NUMERIC=C
>>> # [5] LC_TIME=English_United States.1252
>>>
>>> # attached base packages:
>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> # loaded via a namespace (and not attached):
>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>  withr_1.0.2
>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>> memoise_1.1.0
>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>> lodown_0.1.0
>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>> R.oo_1.21.0
>>> # [17] archive_0.0.0.9000
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Duncan Murdoch-2
On 15/07/2017 11:33 AM, Anthony Damico wrote:
> hi, i realized that the segfault happens on the text file in a new R
> session.  so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not --
> pretty sure that means this is a base R bug?  submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
> am not doing something remarkably stupid.  the text file itself is 4GB
> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
> in the previous message, i think most or all of it needs to be there to
> trigger the segfault.  thanks!

Hopefully someone can debug it with the info you provided.

Duncan Murdoch

>
> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     hi, thanks Dr. Murdoch
>
>
>     i'd appreciate if anyone on r-help could help me narrow this down?
>     i believe the segfault occurs because there's a single line with 4GB
>     and also embedded nuls, but i am not sure how to artificially
>     construct that?
>
>
>     the lodown package can be removed from my example..  it is just for
>     file download cacheing, so `lodown::cachaca` can be replaced with
>     `download.file`  my current example requires a huge download, so
>     sort of painful to repeat but i'm pretty confident that's not the issue.
>
>
>     the archive::archive_extract() function unzips a (probably corrupt)
>     .RAR file and creates a text file with 80,937 lines.  this file is 4GB:
>
>         > file.size(infile)
>         [1] 4078192743 <tel:(407)%20819-2743>
>
>
>     i am pretty sure that nearly all of that 4GB is contained on a
>     single line in the file.  here's what happens when i create a file
>     connection and scan through..
>
>         > file_con <- file( infile , 'r' )
>         >
>         > first_80936_lines <- readLines( file_con , n = 80936 )
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "1000023930632009"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "36F2924009PAULO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "AFONSO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "BA11"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "00000"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "00"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "2924009PAULO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "AFONSO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "BA1111"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "467.20"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "346.10"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "414.40"
>         > scan( w , n = 1 , what = character() )
>         Error in scan(w, n = 1, what = character()) :
>           could not allocate memory (2048 Mb) in C function
>     'R_AllocStringBuffer'
>
>
>
>     making a huge single-line file does not reproduce the problem, i
>     think the embedded nuls have something to do with it--
>
>
>         # WARNING do not run with less than 64GB RAM
>         tf <- tempfile()
>         a <- rep( "a" , 1000000000 )
>         b <- paste( a , collapse = '' )
>         writeLines( b , tf ) ; rm( b ) ; gc()
>         d <- readLines( tf )
>
>
>
>     On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>         On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>             hello, the last line of the code below causes a segfault for
>             me on 3.4.1.
>             i think i should submit to https://bugs.r-project.org/
>             unless others have
>             advice?  thanks
>
>
>         Segfaults are usually worth reporting as bugs.  Try to come up
>         with a self-contained example, not using the lodown and archive
>         packages.  I imagine you can do this by uploading the file you
>         downloaded, or enough of a subset of it to trigger the
>         segfault.  If you can't do that, then likely the bug is with one
>         of those packages, not with R.
>
>         Duncan Murdoch
>
>
>
>
>
>
>             install.packages( "devtools" )
>             devtools::install_github("ajdamico/lodown")
>             devtools::install_github("jimhester/archive")
>
>
>             file_folder <- file.path( tempdir() , "file_folder" )
>
>             tf <- tempfile()
>
>             # large download!  cachaca saves on your local disk if
>             already downloaded
>             lodown::cachaca( '
>             http://download.inep.gov.br/microdados/microdados_enem2009.rar
>             <http://download.inep.gov.br/microdados/microdados_enem2009.rar>'
>             , tf , mode
>             = 'wb' )
>
>             archive::archive_extract( tf , dir = normalizePath(
>             file_folder ) )
>
>             unzipped_files <- list.files( file_folder , recursive = TRUE
>             , full.names =
>             TRUE  )
>
>             infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value
>             = TRUE )
>
>             # works
>             R.utils::countLines( infile )
>
>             # works with warning
>             my_file <- readLines( infile , skipNul = TRUE )
>
>             # crash
>             my_file <- readLines( infile )
>
>
>             # run just before crash
>             sessionInfo()
>             # R version 3.4.1 (2017-06-30)
>             # Platform: x86_64-w64-mingw32/x64 (64-bit)
>             # Running under: Windows 10 x64 (build 15063)
>
>             # Matrix products: default
>
>             # locale:
>             # [1] LC_COLLATE=English_United States.1252
>             # [2] LC_CTYPE=English_United States.1252
>             # [3] LC_MONETARY=English_United States.1252
>             # [4] LC_NUMERIC=C
>             # [5] LC_TIME=English_United States.1252
>
>             # attached base packages:
>             # [1] stats     graphics  grDevices utils     datasets
>             methods   base
>
>             # loaded via a namespace (and not attached):
>              # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>                withr_1.0.2
>              # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>             memoise_1.1.0
>              # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>                 lodown_0.1.0
>             # [13] R.utils_2.5.0      rlang_0.1.1
>             devtools_1.13.2    R.oo_1.21.0
>             # [17] archive_0.0.0.9000
>
>                     [[alternative HTML version deleted]]
>
>             ______________________________________________
>             [hidden email] <mailto:[hidden email]> mailing
>             list -- To UNSUBSCRIBE and more, see
>             https://stat.ethz.ch/mailman/listinfo/r-help
>             <https://stat.ethz.ch/mailman/listinfo/r-help>
>             PLEASE do read the posting guide
>             http://www.R-project.org/posting-guide.html
>             <http://www.R-project.org/posting-guide.html>
>             and provide commented, minimal, self-contained, reproducible
>             code.
>
>
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Jeff Newmiller
In reply to this post by ajdamico
I am not able to reproduce this on a Linux platform:

#######################3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
##                                                "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721

No segfault occurs.

On Sat, 15 Jul 2017, Anthony Damico wrote:

> hi, i realized that the segfault happens on the text file in a new R
> session.  so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not -- pretty
> sure that means this is a base R bug?  submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
> not doing something remarkably stupid.  the text file itself is 4GB so
> cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
> previous message, i think most or all of it needs to be there to trigger
> the segfault.  thanks!
>
>
> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]> wrote:
>
>> hi, thanks Dr. Murdoch
>>
>>
>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>> believe the segfault occurs because there's a single line with 4GB and also
>> embedded nuls, but i am not sure how to artificially construct that?
>>
>>
>> the lodown package can be removed from my example..  it is just for file
>> download cacheing, so `lodown::cachaca` can be replaced with
>> `download.file`  my current example requires a huge download, so sort of
>> painful to repeat but i'm pretty confident that's not the issue.
>>
>>
>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>
>>    > file.size(infile)
>>     [1] 4078192743 <(407)%20819-2743>
>>
>>
>> i am pretty sure that nearly all of that 4GB is contained on a single line
>> in the file.  here's what happens when i create a file connection and scan
>> through..
>>
>>    > file_con <- file( infile , 'r' )
>>    >
>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "1000023930632009"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "36F2924009PAULO"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "AFONSO"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "BA11"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "00000"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "00"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "2924009PAULO"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "AFONSO"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "BA1111"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "467.20"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "346.10"
>>    > scan( w , n = 1 , what = character() )
>>     Read 1 item
>>     [1] "414.40"
>>    > scan( w , n = 1 , what = character() )
>>     Error in scan(w, n = 1, what = character()) :
>>       could not allocate memory (2048 Mb) in C function
>> 'R_AllocStringBuffer'
>>
>>
>>
>> making a huge single-line file does not reproduce the problem, i think the
>> embedded nuls have something to do with it--
>>
>>
>>     # WARNING do not run with less than 64GB RAM
>>     tf <- tempfile()
>>     a <- rep( "a" , 1000000000 )
>>     b <- paste( a , collapse = '' )
>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>     d <- readLines( tf )
>>
>>
>>
>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <[hidden email]>
>> wrote:
>>
>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>
>>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>> have
>>>> advice?  thanks
>>>>
>>>
>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>> self-contained example, not using the lodown and archive packages.  I
>>> imagine you can do this by uploading the file you downloaded, or enough of
>>> a subset of it to trigger the segfault.  If you can't do that, then likely
>>> the bug is with one of those packages, not with R.
>>>
>>> Duncan Murdoch
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>> install.packages( "devtools" )
>>>> devtools::install_github("ajdamico/lodown")
>>>> devtools::install_github("jimhester/archive")
>>>>
>>>>
>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>
>>>> tf <- tempfile()
>>>>
>>>> # large download!  cachaca saves on your local disk if already downloaded
>>>> lodown::cachaca( '
>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>>> mode
>>>> = 'wb' )
>>>>
>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>
>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>> full.names =
>>>> TRUE  )
>>>>
>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>
>>>> # works
>>>> R.utils::countLines( infile )
>>>>
>>>> # works with warning
>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>
>>>> # crash
>>>> my_file <- readLines( infile )
>>>>
>>>>
>>>> # run just before crash
>>>> sessionInfo()
>>>> # R version 3.4.1 (2017-06-30)
>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>> # Running under: Windows 10 x64 (build 15063)
>>>>
>>>> # Matrix products: default
>>>>
>>>> # locale:
>>>> # [1] LC_COLLATE=English_United States.1252
>>>> # [2] LC_CTYPE=English_United States.1252
>>>> # [3] LC_MONETARY=English_United States.1252
>>>> # [4] LC_NUMERIC=C
>>>> # [5] LC_TIME=English_United States.1252
>>>>
>>>> # attached base packages:
>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> # loaded via a namespace (and not attached):
>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>  withr_1.0.2
>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>> memoise_1.1.0
>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>> lodown_0.1.0
>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>> R.oo_1.21.0
>>>> # [17] archive_0.0.0.9000
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Duncan Murdoch-2
In reply to this post by ajdamico
On 15/07/2017 11:33 AM, Anthony Damico wrote:
> hi, i realized that the segfault happens on the text file in a new R
> session.  so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not --
> pretty sure that means this is a base R bug?  submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
> am not doing something remarkably stupid.  the text file itself is 4GB
> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
> in the previous message, i think most or all of it needs to be there to
> trigger the segfault.  thanks!

I don't want to download the big file or install the archive package.
Could you run the code below on the bad file?  If you're right and it's
only nulls that matter, this might allow me to create a file that
triggers the bug.

f <-  # put the filename of the bad file here

con <- file(f, open="rb")
zeros <- numeric()
repeat {
   bytes <- readBin(con, "int", 1000000, size=1)
   zeros <- c(zeros, count + which(bytes == 0))
   count <- count + length(bytes)
   if (length(bytes) < 1000000) break
}
close(con)
cat("File length=", count, "\n")
cat("Nulls:\n")
zeros

Here's some code to recreate a file of the same length with nulls in the
same places, and spaces everywhere else:

size <- count
f2 <- tempfile()
con <- file(f2, open="wb")
count <- 0
while (count < size) {
   nonzeros <- min(c(size - count, 1000000, zeros - 1))
   if (nonzeros) {
     writeBin(rep(32L, nonzeros), con, size = 1)
     count <- count + nonzeros
   }
   zeros <- zeros - nonzeros
   if (length(zeros) && min(zeros) == 1) {
     writeBin(0L, con, size = 1)
     count <- count + 1
     zeros <- zeros[-1] - 1
   }
}
close(con)

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Jeff Newmiller
In reply to this post by Jeff Newmiller
I am not able to reproduce your segfault on a Windows 7 platform either:

##########################
fn1 <- "d:/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
##             d:/DADOS_ENEM_2009.txt
## "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721


On Sat, 15 Jul 2017, Jeff Newmiller wrote:

> I am not able to reproduce this on a Linux platform:
>
> #######################3
> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> 2009/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-pc-linux-gnu (64-bit)
> ## Running under: Ubuntu 14.04.5 LTS
> ##
> ## Matrix products: default
> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> ##
> ## locale:
> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> ##
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
> ##                                                "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
> No segfault occurs.
>
> On Sat, 15 Jul 2017, Anthony Damico wrote:
>
>> hi, i realized that the segfault happens on the text file in a new R
>> session.  so, creating the segfault-generating text file requires a
>> contributed package, but prompting the actual segfault does not -- pretty
>> sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
>> not doing something remarkably stupid.  the text file itself is 4GB so
>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
>> previous message, i think most or all of it needs to be there to trigger
>> the segfault.  thanks!
>>
>>
>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>> wrote:
>>
>>> hi, thanks Dr. Murdoch
>>>
>>>
>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>> believe the segfault occurs because there's a single line with 4GB and
>>> also
>>> embedded nuls, but i am not sure how to artificially construct that?
>>>
>>>
>>> the lodown package can be removed from my example..  it is just for file
>>> download cacheing, so `lodown::cachaca` can be replaced with
>>> `download.file`  my current example requires a huge download, so sort of
>>> painful to repeat but i'm pretty confident that's not the issue.
>>>
>>>
>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>
>>>    > file.size(infile)
>>>     [1] 4078192743 <(407)%20819-2743>
>>>
>>>
>>> i am pretty sure that nearly all of that 4GB is contained on a single line
>>> in the file.  here's what happens when i create a file connection and scan
>>> through..
>>>
>>>    > file_con <- file( infile , 'r' )
>>>    >
>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "1000023930632009"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "36F2924009PAULO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "AFONSO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "BA11"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "00000"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "00"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "2924009PAULO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "AFONSO"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "BA1111"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "467.20"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "346.10"
>>>    > scan( w , n = 1 , what = character() )
>>>     Read 1 item
>>>     [1] "414.40"
>>>    > scan( w , n = 1 , what = character() )
>>>     Error in scan(w, n = 1, what = character()) :
>>>       could not allocate memory (2048 Mb) in C function
>>> 'R_AllocStringBuffer'
>>>
>>>
>>>
>>> making a huge single-line file does not reproduce the problem, i think the
>>> embedded nuls have something to do with it--
>>>
>>>
>>>     # WARNING do not run with less than 64GB RAM
>>>     tf <- tempfile()
>>>     a <- rep( "a" , 1000000000 )
>>>     b <- paste( a , collapse = '' )
>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>     d <- readLines( tf )
>>>
>>>
>>>
>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <[hidden email]>
>>> wrote:
>>>
>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>
>>>>> hello, the last line of the code below causes a segfault for me on
>>>>> 3.4.1.
>>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>>> have
>>>>> advice?  thanks
>>>>>
>>>>
>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>> self-contained example, not using the lodown and archive packages.  I
>>>> imagine you can do this by uploading the file you downloaded, or enough
>>>> of
>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>> likely
>>>> the bug is with one of those packages, not with R.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> install.packages( "devtools" )
>>>>> devtools::install_github("ajdamico/lodown")
>>>>> devtools::install_github("jimhester/archive")
>>>>>
>>>>>
>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>
>>>>> tf <- tempfile()
>>>>>
>>>>> # large download!  cachaca saves on your local disk if already
>>>>> downloaded
>>>>> lodown::cachaca( '
>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>>>> mode
>>>>> = 'wb' )
>>>>>
>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>
>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>> full.names =
>>>>> TRUE  )
>>>>>
>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>
>>>>> # works
>>>>> R.utils::countLines( infile )
>>>>>
>>>>> # works with warning
>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>
>>>>> # crash
>>>>> my_file <- readLines( infile )
>>>>>
>>>>>
>>>>> # run just before crash
>>>>> sessionInfo()
>>>>> # R version 3.4.1 (2017-06-30)
>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>
>>>>> # Matrix products: default
>>>>>
>>>>> # locale:
>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>> # [4] LC_NUMERIC=C
>>>>> # [5] LC_TIME=English_United States.1252
>>>>>
>>>>> # attached base packages:
>>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> # loaded via a namespace (and not attached):
>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>  withr_1.0.2
>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>> memoise_1.1.0
>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>> lodown_0.1.0
>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>> R.oo_1.21.0
>>>>> # [17] archive_0.0.0.9000
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

R help mailing list-2
I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
debugging but gdb gives some information when I attach the debugger after
the 'R..has stopped working' popup appears.  I don't know how reliable it
is:

(gdb) info threads
  Id   Target Id         Frame
* 4    Thread 11848.0x1500 0x00007ffe38dc8861 in ntdll!DbgBreakPoint ()
from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  3    Thread 11848.0x2e90 0x00007ffe38dc87e4 in
ntdll!ZwWaitForWorkViaWorkerFactory ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  2    Thread 11848.0x3618 0x00007ffe38dc5154 in
ntdll!ZwWaitForSingleObject ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  1    Thread 11848.0x1808 0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) thread 1
[Switching to thread 1 (Thread 11848.0x1808)]
#0  0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) where
#0  0x000000006c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#1  0x000000006c7d8919 in R_initAssignSymbols () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#2  0x000000006c7ef961 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#3  0x000000006c7f1b70 in R_cmpfun1 () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#4  0x000000006c7f1ef2 in Rf_applyClosure () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#5  0x000000006c7efaf7 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#6  0x000000006c7f3816 in R_execMethod () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#7  0x000000006c7efcdf in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#8  0x000000006c81053c in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#9  0x000000006c810902 in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#10 0x000000006c810992 in run_Rmainloop () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#11 0x000000000040171c in ?? ()
#12 0x000000000040155a in ?? ()
#13 0x00000000004013e8 in ?? ()
#14 0x000000000040151b in ?? ()
#15 0x00007ffe37868102 in KERNEL32!BaseThreadInitThunk () from
/cygdrive/c/WINDOWS/system32/KERNEL32.DLL
#16 0x00007ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
/cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#17 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller <[hidden email]>
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##########################
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ##             d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> #######################3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>> the
>>> previous message, i think most or all of it needs to be there to trigger
>>> the segfault.  thanks!
>>>
>>>
>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>>> wrote:
>>>
>>> hi, thanks Dr. Murdoch
>>>>
>>>>
>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>> believe the segfault occurs because there's a single line with 4GB and
>>>> also
>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>
>>>>
>>>> the lodown package can be removed from my example..  it is just for file
>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>> `download.file`  my current example requires a huge download, so sort of
>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>
>>>>
>>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>
>>>>    > file.size(infile)
>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>
>>>>
>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>> line
>>>> in the file.  here's what happens when i create a file connection and
>>>> scan
>>>> through..
>>>>
>>>>    > file_con <- file( infile , 'r' )
>>>>    >
>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "1000023930632009"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "36F2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA11"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00000"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA1111"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "467.20"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "346.10"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "414.40"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Error in scan(w, n = 1, what = character()) :
>>>>       could not allocate memory (2048 Mb) in C function
>>>> 'R_AllocStringBuffer'
>>>>
>>>>
>>>>
>>>> making a huge single-line file does not reproduce the problem, i think
>>>> the
>>>> embedded nuls have something to do with it--
>>>>
>>>>
>>>>     # WARNING do not run with less than 64GB RAM
>>>>     tf <- tempfile()
>>>>     a <- rep( "a" , 1000000000 )
>>>>     b <- paste( a , collapse = '' )
>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>     d <- readLines( tf )
>>>>
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>> [hidden email]>
>>>> wrote:
>>>>
>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>
>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>> 3.4.1.
>>>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>>>> have
>>>>>> advice?  thanks
>>>>>>
>>>>>>
>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>> enough of
>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>> likely
>>>>> the bug is with one of those packages, not with R.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> install.packages( "devtools" )
>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>> devtools::install_github("jimhester/archive")
>>>>>>
>>>>>>
>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>
>>>>>> tf <- tempfile()
>>>>>>
>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>> downloaded
>>>>>> lodown::cachaca( '
>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf
>>>>>> ,
>>>>>> mode
>>>>>> = 'wb' )
>>>>>>
>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>
>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>> full.names =
>>>>>> TRUE  )
>>>>>>
>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>
>>>>>> # works
>>>>>> R.utils::countLines( infile )
>>>>>>
>>>>>> # works with warning
>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>
>>>>>> # crash
>>>>>> my_file <- readLines( infile )
>>>>>>
>>>>>>
>>>>>> # run just before crash
>>>>>> sessionInfo()
>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>
>>>>>> # Matrix products: default
>>>>>>
>>>>>> # locale:
>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>> # [4] LC_NUMERIC=C
>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>
>>>>>> # attached base packages:
>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> # loaded via a namespace (and not attached):
>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>  withr_1.0.2
>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>> memoise_1.1.0
>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>> lodown_0.1.0
>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>> R.oo_1.21.0
>>>>>> # [17] archive_0.0.0.9000
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>> ng-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                      Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
In reply to this post by Duncan Murdoch-2
thank you for taking the time to write this.  i set it running last night
and it's still going -- if it doesn't finish by tomorrow, i will try to
find a site to host the problem file and add that link to the bug report so
the archive package can be avoided at least.  i'm sorry for the bother

On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch <[hidden email]>
wrote:

> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>
>> hi, i realized that the segfault happens on the text file in a new R
>> session.  so, creating the segfault-generating text file requires a
>> contributed package, but prompting the actual segfault does not --
>> pretty sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>> am not doing something remarkably stupid.  the text file itself is 4GB
>> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
>> in the previous message, i think most or all of it needs to be there to
>> trigger the segfault.  thanks!
>>
>
> I don't want to download the big file or install the archive package.
> Could you run the code below on the bad file?  If you're right and it's
> only nulls that matter, this might allow me to create a file that triggers
> the bug.
>
> f <-  # put the filename of the bad file here
>
> con <- file(f, open="rb")
> zeros <- numeric()
> repeat {
>   bytes <- readBin(con, "int", 1000000, size=1)
>   zeros <- c(zeros, count + which(bytes == 0))
>   count <- count + length(bytes)
>   if (length(bytes) < 1000000) break
> }
> close(con)
> cat("File length=", count, "\n")
> cat("Nulls:\n")
> zeros
>
> Here's some code to recreate a file of the same length with nulls in the
> same places, and spaces everywhere else:
>
> size <- count
> f2 <- tempfile()
> con <- file(f2, open="wb")
> count <- 0
> while (count < size) {
>   nonzeros <- min(c(size - count, 1000000, zeros - 1))
>   if (nonzeros) {
>     writeBin(rep(32L, nonzeros), con, size = 1)
>     count <- count + nonzeros
>   }
>   zeros <- zeros - nonzeros
>   if (length(zeros) && min(zeros) == 1) {
>     writeBin(0L, con, size = 1)
>     count <- count + 1
>     zeros <- zeros[-1] - 1
>   }
> }
> close(con)
>
> Duncan Murdoch
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
In reply to this post by Jeff Newmiller
hi, thank you for attempting this. it looks like your unix machine unzipped
the txt file without corruption -- if you copied over the same txt file to
windows 7, i don't think that would reproduce the problem?  i think it
needs to be the corrupted text file where   R.utils::countLines( txtfile
)   gives 809367.  i am able to reproduce on two distinct windows machines
but no guarantee i'm not doing something dumb

On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <[hidden email]>
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##########################
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ##             d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> #######################3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>> the
>>> previous message, i think most or all of it needs to be there to trigger
>>> the segfault.  thanks!
>>>
>>>
>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>>> wrote:
>>>
>>> hi, thanks Dr. Murdoch
>>>>
>>>>
>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>> believe the segfault occurs because there's a single line with 4GB and
>>>> also
>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>
>>>>
>>>> the lodown package can be removed from my example..  it is just for file
>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>> `download.file`  my current example requires a huge download, so sort of
>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>
>>>>
>>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>
>>>>    > file.size(infile)
>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>
>>>>
>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>> line
>>>> in the file.  here's what happens when i create a file connection and
>>>> scan
>>>> through..
>>>>
>>>>    > file_con <- file( infile , 'r' )
>>>>    >
>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "1000023930632009"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "36F2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA11"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00000"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "00"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "2924009PAULO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "AFONSO"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "BA1111"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "467.20"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "346.10"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Read 1 item
>>>>     [1] "414.40"
>>>>    > scan( w , n = 1 , what = character() )
>>>>     Error in scan(w, n = 1, what = character()) :
>>>>       could not allocate memory (2048 Mb) in C function
>>>> 'R_AllocStringBuffer'
>>>>
>>>>
>>>>
>>>> making a huge single-line file does not reproduce the problem, i think
>>>> the
>>>> embedded nuls have something to do with it--
>>>>
>>>>
>>>>     # WARNING do not run with less than 64GB RAM
>>>>     tf <- tempfile()
>>>>     a <- rep( "a" , 1000000000 )
>>>>     b <- paste( a , collapse = '' )
>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>     d <- readLines( tf )
>>>>
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>> [hidden email]>
>>>> wrote:
>>>>
>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>
>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>> 3.4.1.
>>>>>> i think i should submit to https://bugs.r-project.org/  unless others
>>>>>> have
>>>>>> advice?  thanks
>>>>>>
>>>>>>
>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>> enough of
>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>> likely
>>>>> the bug is with one of those packages, not with R.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> install.packages( "devtools" )
>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>> devtools::install_github("jimhester/archive")
>>>>>>
>>>>>>
>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>
>>>>>> tf <- tempfile()
>>>>>>
>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>> downloaded
>>>>>> lodown::cachaca( '
>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf
>>>>>> ,
>>>>>> mode
>>>>>> = 'wb' )
>>>>>>
>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>
>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>> full.names =
>>>>>> TRUE  )
>>>>>>
>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>
>>>>>> # works
>>>>>> R.utils::countLines( infile )
>>>>>>
>>>>>> # works with warning
>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>
>>>>>> # crash
>>>>>> my_file <- readLines( infile )
>>>>>>
>>>>>>
>>>>>> # run just before crash
>>>>>> sessionInfo()
>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>
>>>>>> # Matrix products: default
>>>>>>
>>>>>> # locale:
>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>> # [4] LC_NUMERIC=C
>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>
>>>>>> # attached base packages:
>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> # loaded via a namespace (and not attached):
>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>  withr_1.0.2
>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>> memoise_1.1.0
>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>> lodown_0.1.0
>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>> R.oo_1.21.0
>>>>>> # [17] archive_0.0.0.9000
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>> ng-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                      Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ------------------------------------------------------------
> ---------------
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
sorry, typo, 80937 not 809367

On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico <[hidden email]> wrote:

> hi, thank you for attempting this. it looks like your unix machine
> unzipped the txt file without corruption -- if you copied over the same txt
> file to windows 7, i don't think that would reproduce the problem?  i think
> it needs to be the corrupted text file where   R.utils::countLines( txtfile
> )   gives 809367.  i am able to reproduce on two distinct windows machines
> but no guarantee i'm not doing something dumb
>
> On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <[hidden email]>
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##########################
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ##             d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> #######################3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
>>>> session.  so, creating the segfault-generating text file requires a
>>>> contributed package, but prompting the actual segfault does not --
>>>> pretty
>>>> sure that means this is a base R bug?  submitted here:
>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully
>>>> i am
>>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>>> the
>>>> previous message, i think most or all of it needs to be there to trigger
>>>> the segfault.  thanks!
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>>>> wrote:
>>>>
>>>> hi, thanks Dr. Murdoch
>>>>>
>>>>>
>>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>>> believe the segfault occurs because there's a single line with 4GB and
>>>>> also
>>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>>
>>>>>
>>>>> the lodown package can be removed from my example..  it is just for
>>>>> file
>>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>>> `download.file`  my current example requires a huge download, so sort
>>>>> of
>>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>>
>>>>>
>>>>> the archive::archive_extract() function unzips a (probably corrupt)
>>>>> .RAR
>>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>>
>>>>>    > file.size(infile)
>>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>>
>>>>>
>>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>>> line
>>>>> in the file.  here's what happens when i create a file connection and
>>>>> scan
>>>>> through..
>>>>>
>>>>>    > file_con <- file( infile , 'r' )
>>>>>    >
>>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "1000023930632009"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "36F2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA11"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00000"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA1111"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "467.20"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "346.10"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "414.40"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Error in scan(w, n = 1, what = character()) :
>>>>>       could not allocate memory (2048 Mb) in C function
>>>>> 'R_AllocStringBuffer'
>>>>>
>>>>>
>>>>>
>>>>> making a huge single-line file does not reproduce the problem, i think
>>>>> the
>>>>> embedded nuls have something to do with it--
>>>>>
>>>>>
>>>>>     # WARNING do not run with less than 64GB RAM
>>>>>     tf <- tempfile()
>>>>>     a <- rep( "a" , 1000000000 )
>>>>>     b <- paste( a , collapse = '' )
>>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>>     d <- readLines( tf )
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>>
>>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>>> 3.4.1.
>>>>>>> i think i should submit to https://bugs.r-project.org/  unless
>>>>>>> others
>>>>>>> have
>>>>>>> advice?  thanks
>>>>>>>
>>>>>>>
>>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>>> enough of
>>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>>> likely
>>>>>> the bug is with one of those packages, not with R.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> install.packages( "devtools" )
>>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>>> devtools::install_github("jimhester/archive")
>>>>>>>
>>>>>>>
>>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>>
>>>>>>> tf <- tempfile()
>>>>>>>
>>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>>> downloaded
>>>>>>> lodown::cachaca( '
>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' ,
>>>>>>> tf ,
>>>>>>> mode
>>>>>>> = 'wb' )
>>>>>>>
>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>>
>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>>> full.names =
>>>>>>> TRUE  )
>>>>>>>
>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>>
>>>>>>> # works
>>>>>>> R.utils::countLines( infile )
>>>>>>>
>>>>>>> # works with warning
>>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>>
>>>>>>> # crash
>>>>>>> my_file <- readLines( infile )
>>>>>>>
>>>>>>>
>>>>>>> # run just before crash
>>>>>>> sessionInfo()
>>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>>
>>>>>>> # Matrix products: default
>>>>>>>
>>>>>>> # locale:
>>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>>> # [4] LC_NUMERIC=C
>>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>>
>>>>>>> # attached base packages:
>>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
>>>>>>>  base
>>>>>>>
>>>>>>> # loaded via a namespace (and not attached):
>>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>>  withr_1.0.2
>>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>>> memoise_1.1.0
>>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>>> lodown_0.1.0
>>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>>> R.oo_1.21.0
>>>>>>> # [17] archive_0.0.0.9000
>>>>>>>
>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>>> ng-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ---------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>> Live...
>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>> Go...
>>>                                      Live:   OO#.. Dead: OO#..  Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> rocks...1k
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>> ------------------------------------------------------------
>> ---------------
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Duncan Murdoch-2
In reply to this post by ajdamico
On 16/07/2017 6:17 AM, Anthony Damico wrote:
> thank you for taking the time to write this.  i set it running last
> night and it's still going -- if it doesn't finish by tomorrow, i will
> try to find a site to host the problem file and add that link to the bug
> report so the archive package can be avoided at least.  i'm sorry for
> the bother
>

How big is that text file?  I wouldn't expect my script to take more
than a few minutes even on a huge file.

My script might have a bug...

Duncan Murdoch

> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 15/07/2017 11:33 AM, Anthony Damico wrote:
>
>         hi, i realized that the segfault happens on the text file in a new R
>         session.  so, creating the segfault-generating text file requires a
>         contributed package, but prompting the actual segfault does not --
>         pretty sure that means this is a base R bug?  submitted here:
>         https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>         <https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311>
>         hopefully i
>         am not doing something remarkably stupid.  the text file itself
>         is 4GB
>         so cannot upload it to bugzilla, and from the
>         R_AllocStringBugger error
>         in the previous message, i think most or all of it needs to be
>         there to
>         trigger the segfault.  thanks!
>
>
>     I don't want to download the big file or install the archive
>     package. Could you run the code below on the bad file?  If you're
>     right and it's only nulls that matter, this might allow me to create
>     a file that triggers the bug.
>
>     f <-  # put the filename of the bad file here
>
>     con <- file(f, open="rb")
>     zeros <- numeric()
>     repeat {
>       bytes <- readBin(con, "int", 1000000, size=1)
>       zeros <- c(zeros, count + which(bytes == 0))
>       count <- count + length(bytes)
>       if (length(bytes) < 1000000) break
>     }
>     close(con)
>     cat("File length=", count, "\n")
>     cat("Nulls:\n")
>     zeros
>
>     Here's some code to recreate a file of the same length with nulls in
>     the same places, and spaces everywhere else:
>
>     size <- count
>     f2 <- tempfile()
>     con <- file(f2, open="wb")
>     count <- 0
>     while (count < size) {
>       nonzeros <- min(c(size - count, 1000000, zeros - 1))
>       if (nonzeros) {
>         writeBin(rep(32L, nonzeros), con, size = 1)
>         count <- count + nonzeros
>       }
>       zeros <- zeros - nonzeros
>       if (length(zeros) && min(zeros) == 1) {
>         writeBin(0L, con, size = 1)
>         count <- count + 1
>         zeros <- zeros[-1] - 1
>       }
>     }
>     close(con)
>
>     Duncan Murdoch
>
>
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
hi, the text file that prompts the segfault is 4gb but only 80,937 lines

> file.info( "S:/temp/crash.txt")
                        size isdir mode               mtime
ctime               atime exe
S:/temp/crash.txt 4078192743 FALSE  666 2017-07-15 17:24:35 2017-07-15
17:19:47 2017-07-15 17:19:47  no




On Sun, Jul 16, 2017 at 6:34 AM, Duncan Murdoch <[hidden email]>
wrote:

> On 16/07/2017 6:17 AM, Anthony Damico wrote:
>
>> thank you for taking the time to write this.  i set it running last
>> night and it's still going -- if it doesn't finish by tomorrow, i will
>> try to find a site to host the problem file and add that link to the bug
>> report so the archive package can be avoided at least.  i'm sorry for
>> the bother
>>
>>
> How big is that text file?  I wouldn't expect my script to take more than
> a few minutes even on a huge file.
>
> My script might have a bug...
>
> Duncan Murdoch
>
> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>     On 15/07/2017 11:33 AM, Anthony Damico wrote:
>>
>>         hi, i realized that the segfault happens on the text file in a
>> new R
>>         session.  so, creating the segfault-generating text file requires
>> a
>>         contributed package, but prompting the actual segfault does not --
>>         pretty sure that means this is a base R bug?  submitted here:
>>         https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>>         <https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311>
>>         hopefully i
>>         am not doing something remarkably stupid.  the text file itself
>>         is 4GB
>>         so cannot upload it to bugzilla, and from the
>>         R_AllocStringBugger error
>>         in the previous message, i think most or all of it needs to be
>>         there to
>>         trigger the segfault.  thanks!
>>
>>
>>     I don't want to download the big file or install the archive
>>     package. Could you run the code below on the bad file?  If you're
>>     right and it's only nulls that matter, this might allow me to create
>>     a file that triggers the bug.
>>
>>     f <-  # put the filename of the bad file here
>>
>>     con <- file(f, open="rb")
>>     zeros <- numeric()
>>     repeat {
>>       bytes <- readBin(con, "int", 1000000, size=1)
>>       zeros <- c(zeros, count + which(bytes == 0))
>>       count <- count + length(bytes)
>>       if (length(bytes) < 1000000) break
>>     }
>>     close(con)
>>     cat("File length=", count, "\n")
>>     cat("Nulls:\n")
>>     zeros
>>
>>     Here's some code to recreate a file of the same length with nulls in
>>     the same places, and spaces everywhere else:
>>
>>     size <- count
>>     f2 <- tempfile()
>>     con <- file(f2, open="wb")
>>     count <- 0
>>     while (count < size) {
>>       nonzeros <- min(c(size - count, 1000000, zeros - 1))
>>       if (nonzeros) {
>>         writeBin(rep(32L, nonzeros), con, size = 1)
>>         count <- count + nonzeros
>>       }
>>       zeros <- zeros - nonzeros
>>       if (length(zeros) && min(zeros) == 1) {
>>         writeBin(0L, con, size = 1)
>>         count <- count + 1
>>         zeros <- zeros[-1] - 1
>>       }
>>     }
>>     close(con)
>>
>>     Duncan Murdoch
>>
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Jeff Newmiller
In reply to this post by ajdamico
So you are saying there are two problems... one that produces a corrupt file from a valid compressed file, and one that segfaults when presented with that corrupt file? Can you please confirm the file name and run md5sum on it and share the result so we can tell when the file problem has been reproduced?
--
Sent from my phone. Please excuse my brevity.

On July 16, 2017 3:21:21 AM PDT, Anthony Damico <[hidden email]> wrote:

>hi, thank you for attempting this. it looks like your unix machine
>unzipped
>the txt file without corruption -- if you copied over the same txt file
>to
>windows 7, i don't think that would reproduce the problem?  i think it
>needs to be the corrupted text file where   R.utils::countLines(
>txtfile
>)   gives 809367.  i am able to reproduce on two distinct windows
>machines
>but no guarantee i'm not doing something dumb
>
>On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
><[hidden email]>
>wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform
>either:
>>
>> ##########################
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods  
>base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ##             d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> #######################3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats     graphics  grDevices utils     datasets  methods  
>base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
>>>> session.  so, creating the segfault-generating text file requires a
>>>> contributed package, but prompting the actual segfault does not --
>pretty
>>>> sure that means this is a base R bug?  submitted here:
>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 
>hopefully i
>>>> am
>>>> not doing something remarkably stupid.  the text file itself is 4GB
>so
>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
>error in
>>>> the
>>>> previous message, i think most or all of it needs to be there to
>trigger
>>>> the segfault.  thanks!
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
><[hidden email]>
>>>> wrote:
>>>>
>>>> hi, thanks Dr. Murdoch
>>>>>
>>>>>
>>>>> i'd appreciate if anyone on r-help could help me narrow this down?
> i
>>>>> believe the segfault occurs because there's a single line with 4GB
>and
>>>>> also
>>>>> embedded nuls, but i am not sure how to artificially construct
>that?
>>>>>
>>>>>
>>>>> the lodown package can be removed from my example..  it is just
>for file
>>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>>> `download.file`  my current example requires a huge download, so
>sort of
>>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>>
>>>>>
>>>>> the archive::archive_extract() function unzips a (probably
>corrupt) .RAR
>>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>>
>>>>>    > file.size(infile)
>>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>>
>>>>>
>>>>> i am pretty sure that nearly all of that 4GB is contained on a
>single
>>>>> line
>>>>> in the file.  here's what happens when i create a file connection
>and
>>>>> scan
>>>>> through..
>>>>>
>>>>>    > file_con <- file( infile , 'r' )
>>>>>    >
>>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "1000023930632009"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "36F2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA11"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00000"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA1111"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "467.20"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "346.10"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "414.40"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Error in scan(w, n = 1, what = character()) :
>>>>>       could not allocate memory (2048 Mb) in C function
>>>>> 'R_AllocStringBuffer'
>>>>>
>>>>>
>>>>>
>>>>> making a huge single-line file does not reproduce the problem, i
>think
>>>>> the
>>>>> embedded nuls have something to do with it--
>>>>>
>>>>>
>>>>>     # WARNING do not run with less than 64GB RAM
>>>>>     tf <- tempfile()
>>>>>     a <- rep( "a" , 1000000000 )
>>>>>     b <- paste( a , collapse = '' )
>>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>>     d <- readLines( tf )
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>>
>>>>>> hello, the last line of the code below causes a segfault for me
>on
>>>>>>> 3.4.1.
>>>>>>> i think i should submit to https://bugs.r-project.org/  unless
>others
>>>>>>> have
>>>>>>> advice?  thanks
>>>>>>>
>>>>>>>
>>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
>with a
>>>>>> self-contained example, not using the lodown and archive
>packages.  I
>>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>>> enough of
>>>>>> a subset of it to trigger the segfault.  If you can't do that,
>then
>>>>>> likely
>>>>>> the bug is with one of those packages, not with R.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> install.packages( "devtools" )
>>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>>> devtools::install_github("jimhester/archive")
>>>>>>>
>>>>>>>
>>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>>
>>>>>>> tf <- tempfile()
>>>>>>>
>>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>>> downloaded
>>>>>>> lodown::cachaca( '
>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar'
>, tf
>>>>>>> ,
>>>>>>> mode
>>>>>>> = 'wb' )
>>>>>>>
>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder
>) )
>>>>>>>
>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>>> full.names =
>>>>>>> TRUE  )
>>>>>>>
>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
>TRUE )
>>>>>>>
>>>>>>> # works
>>>>>>> R.utils::countLines( infile )
>>>>>>>
>>>>>>> # works with warning
>>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>>
>>>>>>> # crash
>>>>>>> my_file <- readLines( infile )
>>>>>>>
>>>>>>>
>>>>>>> # run just before crash
>>>>>>> sessionInfo()
>>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>>
>>>>>>> # Matrix products: default
>>>>>>>
>>>>>>> # locale:
>>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>>> # [4] LC_NUMERIC=C
>>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>>
>>>>>>> # attached base packages:
>>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
> base
>>>>>>>
>>>>>>> # loaded via a namespace (and not attached):
>>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>>  withr_1.0.2
>>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>>> memoise_1.1.0
>>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>>> lodown_0.1.0
>>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>>> R.oo_1.21.0
>>>>>>> # [17] archive_0.0.0.9000
>>>>>>>
>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
>see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>>> ng-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible
>code.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ---------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>> Live...
>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
>Live
>>> Go...
>>>                                      Live:   OO#.. Dead: OO#..
>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> rocks...1k
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#..
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>rocks...1k
>> ------------------------------------------------------------
>> ---------------
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
hi, yep, there are two problems -- but i think only the segfault is within
the scope of a base R issue?  i need to look closer at the corrupted
decompression and figure out whether i should talk to the brazilian
government agency that creates that .rar file or open an issue with the
archive package maintainer.  my goal in this thread is only to figure out
how to replicate the goofy text file so the r team can turn it into an
error instead of a segfault.

the original example i sent stores the .txt file somewhere inside the
tempdir(), but when i copy it over elsewhere on my machine, the md5sum()
gives the same result.  thanks again for looking at this

    > tools::md5sum(infile)

C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
    "30beb57419486108e98d42ec7a2f8b19"


    > tools::md5sum( "S:/temp/crash.txt" )
                     S:/temp/crash.txt
    "30beb57419486108e98d42ec7a2f8b19"




On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller <[hidden email]>
wrote:

> So you are saying there are two problems... one that produces a corrupt
> file from a valid compressed file, and one that segfaults when presented
> with that corrupt file? Can you please confirm the file name and run md5sum
> on it and share the result so we can tell when the file problem has been
> reproduced?
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <[hidden email]>
> wrote:
> >hi, thank you for attempting this. it looks like your unix machine
> >unzipped
> >the txt file without corruption -- if you copied over the same txt file
> >to
> >windows 7, i don't think that would reproduce the problem?  i think it
> >needs to be the corrupted text file where   R.utils::countLines(
> >txtfile
> >)   gives 809367.  i am able to reproduce on two distinct windows
> >machines
> >but no guarantee i'm not doing something dumb
> >
> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> ><[hidden email]>
> >wrote:
> >
> >> I am not able to reproduce your segfault on a Windows 7 platform
> >either:
> >>
> >> ##########################
> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> sessionInfo()
> >> ## R version 3.4.1 (2017-06-30)
> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> ##
> >> ## Matrix products: default
> >> ##
> >> ## locale:
> >> ## [1] LC_COLLATE=English_United States.1252
> >> ## [2] LC_CTYPE=English_United States.1252
> >> ## [3] LC_MONETARY=English_United States.1252
> >> ## [4] LC_NUMERIC=C
> >> ## [5] LC_TIME=English_United States.1252
> >> ##
> >> ## attached base packages:
> >> ## [1] stats     graphics  grDevices utils     datasets  methods
> >base
> >> ##
> >> ## loaded via a namespace (and not attached):
> >> ## [1] compiler_3.4.1
> >> tools::md5sum( fn1 )
> >> ##             d:/DADOS_ENEM_2009.txt
> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> dat <- readLines( fn1 )
> >> length( dat )
> >> ## [1] 4148721
> >>
> >>
> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >>
> >> I am not able to reproduce this on a Linux platform:
> >>>
> >>> #######################3
> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt"
> >>> sessionInfo()
> >>> ## R version 3.4.1 (2017-06-30)
> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >>> ## Running under: Ubuntu 14.04.5 LTS
> >>> ##
> >>> ## Matrix products: default
> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >>> ##
> >>> ## locale:
> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>> ##
> >>> ## attached base packages:
> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
> >base
> >>> ##
> >>> ## loaded via a namespace (and not attached):
> >>> ## [1] compiler_3.4.1
> >>> tools::md5sum( fn1 )
> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt
> >>> ##
> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >>> dat <- readLines( fn1 )
> >>> length( dat )
> >>> ## [1] 4148721
> >>>
> >>> No segfault occurs.
> >>>
> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >>>
> >>> hi, i realized that the segfault happens on the text file in a new R
> >>>> session.  so, creating the segfault-generating text file requires a
> >>>> contributed package, but prompting the actual segfault does not --
> >pretty
> >>>> sure that means this is a base R bug?  submitted here:
> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >hopefully i
> >>>> am
> >>>> not doing something remarkably stupid.  the text file itself is 4GB
> >so
> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
> >error in
> >>>> the
> >>>> previous message, i think most or all of it needs to be there to
> >trigger
> >>>> the segfault.  thanks!
> >>>>
> >>>>
> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
> ><[hidden email]>
> >>>> wrote:
> >>>>
> >>>> hi, thanks Dr. Murdoch
> >>>>>
> >>>>>
> >>>>> i'd appreciate if anyone on r-help could help me narrow this down?
> > i
> >>>>> believe the segfault occurs because there's a single line with 4GB
> >and
> >>>>> also
> >>>>> embedded nuls, but i am not sure how to artificially construct
> >that?
> >>>>>
> >>>>>
> >>>>> the lodown package can be removed from my example..  it is just
> >for file
> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
> >>>>> `download.file`  my current example requires a huge download, so
> >sort of
> >>>>> painful to repeat but i'm pretty confident that's not the issue.
> >>>>>
> >>>>>
> >>>>> the archive::archive_extract() function unzips a (probably
> >corrupt) .RAR
> >>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
> >>>>>
> >>>>>    > file.size(infile)
> >>>>>     [1] 4078192743 <(407)%20819-2743>
> >>>>>
> >>>>>
> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
> >single
> >>>>> line
> >>>>> in the file.  here's what happens when i create a file connection
> >and
> >>>>> scan
> >>>>> through..
> >>>>>
> >>>>>    > file_con <- file( infile , 'r' )
> >>>>>    >
> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "1000023930632009"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "36F2924009PAULO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "AFONSO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "BA11"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "00000"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "00"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "2924009PAULO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "AFONSO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "BA1111"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "467.20"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "346.10"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "414.40"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Error in scan(w, n = 1, what = character()) :
> >>>>>       could not allocate memory (2048 Mb) in C function
> >>>>> 'R_AllocStringBuffer'
> >>>>>
> >>>>>
> >>>>>
> >>>>> making a huge single-line file does not reproduce the problem, i
> >think
> >>>>> the
> >>>>> embedded nuls have something to do with it--
> >>>>>
> >>>>>
> >>>>>     # WARNING do not run with less than 64GB RAM
> >>>>>     tf <- tempfile()
> >>>>>     a <- rep( "a" , 1000000000 )
> >>>>>     b <- paste( a , collapse = '' )
> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
> >>>>>     d <- readLines( tf )
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
> >>>>> [hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
> >>>>>>
> >>>>>> hello, the last line of the code below causes a segfault for me
> >on
> >>>>>>> 3.4.1.
> >>>>>>> i think i should submit to https://bugs.r-project.org/  unless
> >others
> >>>>>>> have
> >>>>>>> advice?  thanks
> >>>>>>>
> >>>>>>>
> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
> >with a
> >>>>>> self-contained example, not using the lodown and archive
> >packages.  I
> >>>>>> imagine you can do this by uploading the file you downloaded, or
> >>>>>> enough of
> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
> >then
> >>>>>> likely
> >>>>>> the bug is with one of those packages, not with R.
> >>>>>>
> >>>>>> Duncan Murdoch
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> install.packages( "devtools" )
> >>>>>>> devtools::install_github("ajdamico/lodown")
> >>>>>>> devtools::install_github("jimhester/archive")
> >>>>>>>
> >>>>>>>
> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
> >>>>>>>
> >>>>>>> tf <- tempfile()
> >>>>>>>
> >>>>>>> # large download!  cachaca saves on your local disk if already
> >>>>>>> downloaded
> >>>>>>> lodown::cachaca( '
> >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar'
> >, tf
> >>>>>>> ,
> >>>>>>> mode
> >>>>>>> = 'wb' )
> >>>>>>>
> >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder
> >) )
> >>>>>>>
> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
> >>>>>>> full.names =
> >>>>>>> TRUE  )
> >>>>>>>
> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
> >TRUE )
> >>>>>>>
> >>>>>>> # works
> >>>>>>> R.utils::countLines( infile )
> >>>>>>>
> >>>>>>> # works with warning
> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
> >>>>>>>
> >>>>>>> # crash
> >>>>>>> my_file <- readLines( infile )
> >>>>>>>
> >>>>>>>
> >>>>>>> # run just before crash
> >>>>>>> sessionInfo()
> >>>>>>> # R version 3.4.1 (2017-06-30)
> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> >>>>>>> # Running under: Windows 10 x64 (build 15063)
> >>>>>>>
> >>>>>>> # Matrix products: default
> >>>>>>>
> >>>>>>> # locale:
> >>>>>>> # [1] LC_COLLATE=English_United States.1252
> >>>>>>> # [2] LC_CTYPE=English_United States.1252
> >>>>>>> # [3] LC_MONETARY=English_United States.1252
> >>>>>>> # [4] LC_NUMERIC=C
> >>>>>>> # [5] LC_TIME=English_United States.1252
> >>>>>>>
> >>>>>>> # attached base packages:
> >>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
> > base
> >>>>>>>
> >>>>>>> # loaded via a namespace (and not attached):
> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
> >>>>>>>  withr_1.0.2
> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
> >>>>>>> memoise_1.1.0
> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
> >>>>>>> lodown_0.1.0
> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
> >>>>>>> R.oo_1.21.0
> >>>>>>> # [17] archive_0.0.0.9000
> >>>>>>>
> >>>>>>>         [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
> >see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>>>>>> ng-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible
> >code.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> ______________________________________________
> >>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>>> ng-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>>
> >>> ------------------------------------------------------------
> >>> ---------------
> >>> Jeff Newmiller                        The     .....       .....  Go
> >>> Live...
> >>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
> >Live
> >>> Go...
> >>>                                      Live:   OO#.. Dead: OO#..
> >Playing
> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >with
> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >>> rocks...1k
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>> ng-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >> ------------------------------------------------------------
> >> ---------------
> >> Jeff Newmiller                        The     .....       .....  Go
> >Live...
> >> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> >> Go...
> >>                                       Live:   OO#.. Dead: OO#..
> >Playing
> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> >rocks...1k
> >> ------------------------------------------------------------
> >> ---------------
> >>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Jeff Newmiller
I am stuck. The archive package won't compile for me on Ubuntu, and the CRANextra repo seems to be down so I cannot install packages on Windows right now. Perhaps you can zip the corrupt text file and put it online somewhere? Don't use the archive package to pack it since there seem to be issues with that tool on your machine.

I would discourage you from harassing the Brazilian government about their RAR file because the RAR file seems fine (no NUL characters appear in the text file) when extracted using the file-roller archive tool on Ubuntu.
--
Sent from my phone. Please excuse my brevity.

On July 16, 2017 9:37:17 AM PDT, Anthony Damico <[hidden email]> wrote:

>hi, yep, there are two problems -- but i think only the segfault is
>within
>the scope of a base R issue?  i need to look closer at the corrupted
>decompression and figure out whether i should talk to the brazilian
>government agency that creates that .rar file or open an issue with the
>archive package maintainer.  my goal in this thread is only to figure
>out
>how to replicate the goofy text file so the r team can turn it into an
>error instead of a segfault.
>
>the original example i sent stores the .txt file somewhere inside the
>tempdir(), but when i copy it over elsewhere on my machine, the
>md5sum()
>gives the same result.  thanks again for looking at this
>
>    > tools::md5sum(infile)
>
>C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
>ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>    "30beb57419486108e98d42ec7a2f8b19"
>
>
>    > tools::md5sum( "S:/temp/crash.txt" )
>                     S:/temp/crash.txt
>    "30beb57419486108e98d42ec7a2f8b19"
>
>
>
>
>On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
><[hidden email]>
>wrote:
>
>> So you are saying there are two problems... one that produces a
>corrupt
>> file from a valid compressed file, and one that segfaults when
>presented
>> with that corrupt file? Can you please confirm the file name and run
>md5sum
>> on it and share the result so we can tell when the file problem has
>been
>> reproduced?
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <[hidden email]>
>> wrote:
>> >hi, thank you for attempting this. it looks like your unix machine
>> >unzipped
>> >the txt file without corruption -- if you copied over the same txt
>file
>> >to
>> >windows 7, i don't think that would reproduce the problem?  i think
>it
>> >needs to be the corrupted text file where   R.utils::countLines(
>> >txtfile
>> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >machines
>> >but no guarantee i'm not doing something dumb
>> >
>> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> ><[hidden email]>
>> >wrote:
>> >
>> >> I am not able to reproduce your segfault on a Windows 7 platform
>> >either:
>> >>
>> >> ##########################
>> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> sessionInfo()
>> >> ## R version 3.4.1 (2017-06-30)
>> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> ##
>> >> ## Matrix products: default
>> >> ##
>> >> ## locale:
>> >> ## [1] LC_COLLATE=English_United States.1252
>> >> ## [2] LC_CTYPE=English_United States.1252
>> >> ## [3] LC_MONETARY=English_United States.1252
>> >> ## [4] LC_NUMERIC=C
>> >> ## [5] LC_TIME=English_United States.1252
>> >> ##
>> >> ## attached base packages:
>> >> ## [1] stats     graphics  grDevices utils     datasets  methods
>> >base
>> >> ##
>> >> ## loaded via a namespace (and not attached):
>> >> ## [1] compiler_3.4.1
>> >> tools::md5sum( fn1 )
>> >> ##             d:/DADOS_ENEM_2009.txt
>> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> dat <- readLines( fn1 )
>> >> length( dat )
>> >> ## [1] 4148721
>> >>
>> >>
>> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >>
>> >> I am not able to reproduce this on a Linux platform:
>> >>>
>> >>> #######################3
>> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt"
>> >>> sessionInfo()
>> >>> ## R version 3.4.1 (2017-06-30)
>> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >>> ##
>> >>> ## Matrix products: default
>> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >>> ##
>> >>> ## locale:
>> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>> ##
>> >>> ## attached base packages:
>> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
>> >base
>> >>> ##
>> >>> ## loaded via a namespace (and not attached):
>> >>> ## [1] compiler_3.4.1
>> >>> tools::md5sum( fn1 )
>> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt
>> >>> ##
>> >>> "83e61c96092285b60d7bf6b0dbc7072e"
>> >>> dat <- readLines( fn1 )
>> >>> length( dat )
>> >>> ## [1] 4148721
>> >>>
>> >>> No segfault occurs.
>> >>>
>> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>> >>>
>> >>> hi, i realized that the segfault happens on the text file in a
>new R
>> >>>> session.  so, creating the segfault-generating text file
>requires a
>> >>>> contributed package, but prompting the actual segfault does not
>--
>> >pretty
>> >>>> sure that means this is a base R bug?  submitted here:
>> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> >hopefully i
>> >>>> am
>> >>>> not doing something remarkably stupid.  the text file itself is
>4GB
>> >so
>> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
>> >error in
>> >>>> the
>> >>>> previous message, i think most or all of it needs to be there to
>> >trigger
>> >>>> the segfault.  thanks!
>> >>>>
>> >>>>
>> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>> ><[hidden email]>
>> >>>> wrote:
>> >>>>
>> >>>> hi, thanks Dr. Murdoch
>> >>>>>
>> >>>>>
>> >>>>> i'd appreciate if anyone on r-help could help me narrow this
>down?
>> > i
>> >>>>> believe the segfault occurs because there's a single line with
>4GB
>> >and
>> >>>>> also
>> >>>>> embedded nuls, but i am not sure how to artificially construct
>> >that?
>> >>>>>
>> >>>>>
>> >>>>> the lodown package can be removed from my example..  it is just
>> >for file
>> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
>> >>>>> `download.file`  my current example requires a huge download,
>so
>> >sort of
>> >>>>> painful to repeat but i'm pretty confident that's not the
>issue.
>> >>>>>
>> >>>>>
>> >>>>> the archive::archive_extract() function unzips a (probably
>> >corrupt) .RAR
>> >>>>> file and creates a text file with 80,937 lines.  this file is
>4GB:
>> >>>>>
>> >>>>>    > file.size(infile)
>> >>>>>     [1] 4078192743 <(407)%20819-2743>
>> >>>>>
>> >>>>>
>> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
>> >single
>> >>>>> line
>> >>>>> in the file.  here's what happens when i create a file
>connection
>> >and
>> >>>>> scan
>> >>>>> through..
>> >>>>>
>> >>>>>    > file_con <- file( infile , 'r' )
>> >>>>>    >
>> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "1000023930632009"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "36F2924009PAULO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "AFONSO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "BA11"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "00000"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "00"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "2924009PAULO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "AFONSO"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "BA1111"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "467.20"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "346.10"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Read 1 item
>> >>>>>     [1] "414.40"
>> >>>>>    > scan( w , n = 1 , what = character() )
>> >>>>>     Error in scan(w, n = 1, what = character()) :
>> >>>>>       could not allocate memory (2048 Mb) in C function
>> >>>>> 'R_AllocStringBuffer'
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> making a huge single-line file does not reproduce the problem,
>i
>> >think
>> >>>>> the
>> >>>>> embedded nuls have something to do with it--
>> >>>>>
>> >>>>>
>> >>>>>     # WARNING do not run with less than 64GB RAM
>> >>>>>     tf <- tempfile()
>> >>>>>     a <- rep( "a" , 1000000000 )
>> >>>>>     b <- paste( a , collapse = '' )
>> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>> >>>>>     d <- readLines( tf )
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>> >>>>> [hidden email]>
>> >>>>> wrote:
>> >>>>>
>> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>> >>>>>>
>> >>>>>> hello, the last line of the code below causes a segfault for
>me
>> >on
>> >>>>>>> 3.4.1.
>> >>>>>>> i think i should submit to https://bugs.r-project.org/ 
>unless
>> >others
>> >>>>>>> have
>> >>>>>>> advice?  thanks
>> >>>>>>>
>> >>>>>>>
>> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
>> >with a
>> >>>>>> self-contained example, not using the lodown and archive
>> >packages.  I
>> >>>>>> imagine you can do this by uploading the file you downloaded,
>or
>> >>>>>> enough of
>> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
>> >then
>> >>>>>> likely
>> >>>>>> the bug is with one of those packages, not with R.
>> >>>>>>
>> >>>>>> Duncan Murdoch
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> install.packages( "devtools" )
>> >>>>>>> devtools::install_github("ajdamico/lodown")
>> >>>>>>> devtools::install_github("jimhester/archive")
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>> >>>>>>>
>> >>>>>>> tf <- tempfile()
>> >>>>>>>
>> >>>>>>> # large download!  cachaca saves on your local disk if
>already
>> >>>>>>> downloaded
>> >>>>>>> lodown::cachaca( '
>> >>>>>>>
>http://download.inep.gov.br/microdados/microdados_enem2009.rar'
>> >, tf
>> >>>>>>> ,
>> >>>>>>> mode
>> >>>>>>> = 'wb' )
>> >>>>>>>
>> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
>file_folder
>> >) )
>> >>>>>>>
>> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE
>,
>> >>>>>>> full.names =
>> >>>>>>> TRUE  )
>> >>>>>>>
>> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
>> >TRUE )
>> >>>>>>>
>> >>>>>>> # works
>> >>>>>>> R.utils::countLines( infile )
>> >>>>>>>
>> >>>>>>> # works with warning
>> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>> >>>>>>>
>> >>>>>>> # crash
>> >>>>>>> my_file <- readLines( infile )
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> # run just before crash
>> >>>>>>> sessionInfo()
>> >>>>>>> # R version 3.4.1 (2017-06-30)
>> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >>>>>>> # Running under: Windows 10 x64 (build 15063)
>> >>>>>>>
>> >>>>>>> # Matrix products: default
>> >>>>>>>
>> >>>>>>> # locale:
>> >>>>>>> # [1] LC_COLLATE=English_United States.1252
>> >>>>>>> # [2] LC_CTYPE=English_United States.1252
>> >>>>>>> # [3] LC_MONETARY=English_United States.1252
>> >>>>>>> # [4] LC_NUMERIC=C
>> >>>>>>> # [5] LC_TIME=English_United States.1252
>> >>>>>>>
>> >>>>>>> # attached base packages:
>> >>>>>>> # [1] stats     graphics  grDevices utils     datasets
>methods
>> > base
>> >>>>>>>
>> >>>>>>> # loaded via a namespace (and not attached):
>> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>> >>>>>>>  withr_1.0.2
>> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> >>>>>>> memoise_1.1.0
>> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> >>>>>>> lodown_0.1.0
>> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>> >>>>>>> R.oo_1.21.0
>> >>>>>>> # [17] archive_0.0.0.9000
>> >>>>>>>
>> >>>>>>>         [[alternative HTML version deleted]]
>> >>>>>>>
>> >>>>>>> ______________________________________________
>> >>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
>> >see
>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>>>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >>>>>>> ng-guide.html
>> >>>>>>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>         [[alternative HTML version deleted]]
>> >>>>
>> >>>> ______________________________________________
>> >>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
>see
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
>> >>>> ng-guide.html
>> >>>> and provide commented, minimal, self-contained, reproducible
>code.
>> >>>>
>> >>>>
>> >>> ------------------------------------------------------------
>> >>> ---------------
>> >>> Jeff Newmiller                        The     .....       .....
>Go
>> >>> Live...
>> >>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
>> >Live
>> >>> Go...
>> >>>                                      Live:   OO#.. Dead: OO#..
>> >Playing
>> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>> >with
>> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >>> rocks...1k
>> >>>
>> >>> ______________________________________________
>> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> PLEASE do read the posting guide http://www.R-project.org/posti
>> >>> ng-guide.html
>> >>> and provide commented, minimal, self-contained, reproducible
>code.
>> >>>
>> >>>
>> >> ------------------------------------------------------------
>> >> ---------------
>> >> Jeff Newmiller                        The     .....       .....
>Go
>> >Live...
>> >> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
>Live
>> >> Go...
>> >>                                       Live:   OO#.. Dead: OO#..
>> >Playing
>> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>with
>> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >rocks...1k
>> >> ------------------------------------------------------------
>> >> ---------------
>> >>
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

ajdamico
hi, thanks again for taking the time.  since corrupted compression prompted
the segfault for me in the first place, i've just posted the text file
as-is.  it's a 2.4GB file so to be avoided on a metered internet
connection.  i've updated the bugzilla report at
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
relevant info.  these lines of code crash both windows R 3.4.1 and also
linux R 3.3.3 for me.  thanks again


    # consider changing `tempfile()` to a permanent location
    # so you don't lose the large downloaded file after the crash
    tf <- tempfile()
    download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt"
, tf , mode = 'wb' )
    sessionInfo()
    x <- readLines( tf )




On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller <[hidden email]>
wrote:

> I am stuck. The archive package won't compile for me on Ubuntu, and the
> CRANextra repo seems to be down so I cannot install packages on Windows
> right now. Perhaps you can zip the corrupt text file and put it online
> somewhere? Don't use the archive package to pack it since there seem to be
> issues with that tool on your machine.
>
> I would discourage you from harassing the Brazilian government about their
> RAR file because the RAR file seems fine (no NUL characters appear in the
> text file) when extracted using the file-roller archive tool on Ubuntu.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 9:37:17 AM PDT, Anthony Damico <[hidden email]>
> wrote:
> >hi, yep, there are two problems -- but i think only the segfault is
> >within
> >the scope of a base R issue?  i need to look closer at the corrupted
> >decompression and figure out whether i should talk to the brazilian
> >government agency that creates that .rar file or open an issue with the
> >archive package maintainer.  my goal in this thread is only to figure
> >out
> >how to replicate the goofy text file so the r team can turn it into an
> >error instead of a segfault.
> >
> >the original example i sent stores the .txt file somewhere inside the
> >tempdir(), but when i copy it over elsewhere on my machine, the
> >md5sum()
> >gives the same result.  thanks again for looking at this
> >
> >    > tools::md5sum(infile)
> >
> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
> folder/Microdados
> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
> >    "30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >    > tools::md5sum( "S:/temp/crash.txt" )
> >                     S:/temp/crash.txt
> >    "30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >
> >
> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
> ><[hidden email]>
> >wrote:
> >
> >> So you are saying there are two problems... one that produces a
> >corrupt
> >> file from a valid compressed file, and one that segfaults when
> >presented
> >> with that corrupt file? Can you please confirm the file name and run
> >md5sum
> >> on it and share the result so we can tell when the file problem has
> >been
> >> reproduced?
> >> --
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <[hidden email]>
> >> wrote:
> >> >hi, thank you for attempting this. it looks like your unix machine
> >> >unzipped
> >> >the txt file without corruption -- if you copied over the same txt
> >file
> >> >to
> >> >windows 7, i don't think that would reproduce the problem?  i think
> >it
> >> >needs to be the corrupted text file where   R.utils::countLines(
> >> >txtfile
> >> >)   gives 809367.  i am able to reproduce on two distinct windows
> >> >machines
> >> >but no guarantee i'm not doing something dumb
> >> >
> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> >> ><[hidden email]>
> >> >wrote:
> >> >
> >> >> I am not able to reproduce your segfault on a Windows 7 platform
> >> >either:
> >> >>
> >> >> ##########################
> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> >> sessionInfo()
> >> >> ## R version 3.4.1 (2017-06-30)
> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> >> ##
> >> >> ## Matrix products: default
> >> >> ##
> >> >> ## locale:
> >> >> ## [1] LC_COLLATE=English_United States.1252
> >> >> ## [2] LC_CTYPE=English_United States.1252
> >> >> ## [3] LC_MONETARY=English_United States.1252
> >> >> ## [4] LC_NUMERIC=C
> >> >> ## [5] LC_TIME=English_United States.1252
> >> >> ##
> >> >> ## attached base packages:
> >> >> ## [1] stats     graphics  grDevices utils     datasets  methods
> >> >base
> >> >> ##
> >> >> ## loaded via a namespace (and not attached):
> >> >> ## [1] compiler_3.4.1
> >> >> tools::md5sum( fn1 )
> >> >> ##             d:/DADOS_ENEM_2009.txt
> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> >> dat <- readLines( fn1 )
> >> >> length( dat )
> >> >> ## [1] 4148721
> >> >>
> >> >>
> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >> >>
> >> >> I am not able to reproduce this on a Linux platform:
> >> >>>
> >> >>> #######################3
> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >> >>> 2009/DADOS_ENEM_2009.txt"
> >> >>> sessionInfo()
> >> >>> ## R version 3.4.1 (2017-06-30)
> >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >> >>> ## Running under: Ubuntu 14.04.5 LTS
> >> >>> ##
> >> >>> ## Matrix products: default
> >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >> >>> ##
> >> >>> ## locale:
> >> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >> >>> ##
> >> >>> ## attached base packages:
> >> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
> >> >base
> >> >>> ##
> >> >>> ## loaded via a namespace (and not attached):
> >> >>> ## [1] compiler_3.4.1
> >> >>> tools::md5sum( fn1 )
> >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >> >>> 2009/DADOS_ENEM_2009.txt
> >> >>> ##
> >> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >> >>> dat <- readLines( fn1 )
> >> >>> length( dat )
> >> >>> ## [1] 4148721
> >> >>>
> >> >>> No segfault occurs.
> >> >>>
> >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >> >>>
> >> >>> hi, i realized that the segfault happens on the text file in a
> >new R
> >> >>>> session.  so, creating the segfault-generating text file
> >requires a
> >> >>>> contributed package, but prompting the actual segfault does not
> >--
> >> >pretty
> >> >>>> sure that means this is a base R bug?  submitted here:
> >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >> >hopefully i
> >> >>>> am
> >> >>>> not doing something remarkably stupid.  the text file itself is
> >4GB
> >> >so
> >> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
> >> >error in
> >> >>>> the
> >> >>>> previous message, i think most or all of it needs to be there to
> >> >trigger
> >> >>>> the segfault.  thanks!
> >> >>>>
> >> >>>>
> >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
> >> ><[hidden email]>
> >> >>>> wrote:
> >> >>>>
> >> >>>> hi, thanks Dr. Murdoch
> >> >>>>>
> >> >>>>>
> >> >>>>> i'd appreciate if anyone on r-help could help me narrow this
> >down?
> >> > i
> >> >>>>> believe the segfault occurs because there's a single line with
> >4GB
> >> >and
> >> >>>>> also
> >> >>>>> embedded nuls, but i am not sure how to artificially construct
> >> >that?
> >> >>>>>
> >> >>>>>
> >> >>>>> the lodown package can be removed from my example..  it is just
> >> >for file
> >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
> >> >>>>> `download.file`  my current example requires a huge download,
> >so
> >> >sort of
> >> >>>>> painful to repeat but i'm pretty confident that's not the
> >issue.
> >> >>>>>
> >> >>>>>
> >> >>>>> the archive::archive_extract() function unzips a (probably
> >> >corrupt) .RAR
> >> >>>>> file and creates a text file with 80,937 lines.  this file is
> >4GB:
> >> >>>>>
> >> >>>>>    > file.size(infile)
> >> >>>>>     [1] 4078192743 <(407)%20819-2743>
> >> >>>>>
> >> >>>>>
> >> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
> >> >single
> >> >>>>> line
> >> >>>>> in the file.  here's what happens when i create a file
> >connection
> >> >and
> >> >>>>> scan
> >> >>>>> through..
> >> >>>>>
> >> >>>>>    > file_con <- file( infile , 'r' )
> >> >>>>>    >
> >> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "1000023930632009"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "36F2924009PAULO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "AFONSO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "BA11"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "00000"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "00"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "2924009PAULO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "AFONSO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "BA1111"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "467.20"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "346.10"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "414.40"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Error in scan(w, n = 1, what = character()) :
> >> >>>>>       could not allocate memory (2048 Mb) in C function
> >> >>>>> 'R_AllocStringBuffer'
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> making a huge single-line file does not reproduce the problem,
> >i
> >> >think
> >> >>>>> the
> >> >>>>> embedded nuls have something to do with it--
> >> >>>>>
> >> >>>>>
> >> >>>>>     # WARNING do not run with less than 64GB RAM
> >> >>>>>     tf <- tempfile()
> >> >>>>>     a <- rep( "a" , 1000000000 )
> >> >>>>>     b <- paste( a , collapse = '' )
> >> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
> >> >>>>>     d <- readLines( tf )
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
> >> >>>>> [hidden email]>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
> >> >>>>>>
> >> >>>>>> hello, the last line of the code below causes a segfault for
> >me
> >> >on
> >> >>>>>>> 3.4.1.
> >> >>>>>>> i think i should submit to https://bugs.r-project.org/
> >unless
> >> >others
> >> >>>>>>> have
> >> >>>>>>> advice?  thanks
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
> >> >with a
> >> >>>>>> self-contained example, not using the lodown and archive
> >> >packages.  I
> >> >>>>>> imagine you can do this by uploading the file you downloaded,
> >or
> >> >>>>>> enough of
> >> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
> >> >then
> >> >>>>>> likely
> >> >>>>>> the bug is with one of those packages, not with R.
> >> >>>>>>
> >> >>>>>> Duncan Murdoch
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> install.packages( "devtools" )
> >> >>>>>>> devtools::install_github("ajdamico/lodown")
> >> >>>>>>> devtools::install_github("jimhester/archive")
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
> >> >>>>>>>
> >> >>>>>>> tf <- tempfile()
> >> >>>>>>>
> >> >>>>>>> # large download!  cachaca saves on your local disk if
> >already
> >> >>>>>>> downloaded
> >> >>>>>>> lodown::cachaca( '
> >> >>>>>>>
> >http://download.inep.gov.br/microdados/microdados_enem2009.rar'
> >> >, tf
> >> >>>>>>> ,
> >> >>>>>>> mode
> >> >>>>>>> = 'wb' )
> >> >>>>>>>
> >> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
> >file_folder
> >> >) )
> >> >>>>>>>
> >> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE
> >,
> >> >>>>>>> full.names =
> >> >>>>>>> TRUE  )
> >> >>>>>>>
> >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
> >> >TRUE )
> >> >>>>>>>
> >> >>>>>>> # works
> >> >>>>>>> R.utils::countLines( infile )
> >> >>>>>>>
> >> >>>>>>> # works with warning
> >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
> >> >>>>>>>
> >> >>>>>>> # crash
> >> >>>>>>> my_file <- readLines( infile )
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> # run just before crash
> >> >>>>>>> sessionInfo()
> >> >>>>>>> # R version 3.4.1 (2017-06-30)
> >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> >>>>>>> # Running under: Windows 10 x64 (build 15063)
> >> >>>>>>>
> >> >>>>>>> # Matrix products: default
> >> >>>>>>>
> >> >>>>>>> # locale:
> >> >>>>>>> # [1] LC_COLLATE=English_United States.1252
> >> >>>>>>> # [2] LC_CTYPE=English_United States.1252
> >> >>>>>>> # [3] LC_MONETARY=English_United States.1252
> >> >>>>>>> # [4] LC_NUMERIC=C
> >> >>>>>>> # [5] LC_TIME=English_United States.1252
> >> >>>>>>>
> >> >>>>>>> # attached base packages:
> >> >>>>>>> # [1] stats     graphics  grDevices utils     datasets
> >methods
> >> > base
> >> >>>>>>>
> >> >>>>>>> # loaded via a namespace (and not attached):
> >> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
> >> >>>>>>>  withr_1.0.2
> >> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
> >> >>>>>>> memoise_1.1.0
> >> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
> >> >>>>>>> lodown_0.1.0
> >> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
> >> >>>>>>> R.oo_1.21.0
> >> >>>>>>> # [17] archive_0.0.0.9000
> >> >>>>>>>
> >> >>>>>>>         [[alternative HTML version deleted]]
> >> >>>>>>>
> >> >>>>>>> ______________________________________________
> >> >>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
> >> >see
> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>>>> PLEASE do read the posting guide
> >http://www.R-project.org/posti
> >> >>>>>>> ng-guide.html
> >> >>>>>>> and provide commented, minimal, self-contained, reproducible
> >> >code.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>         [[alternative HTML version deleted]]
> >> >>>>
> >> >>>> ______________________________________________
> >> >>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
> >see
> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >> >>>> ng-guide.html
> >> >>>> and provide commented, minimal, self-contained, reproducible
> >code.
> >> >>>>
> >> >>>>
> >> >>> ------------------------------------------------------------
> >> >>> ---------------
> >> >>> Jeff Newmiller                        The     .....       .....
> >Go
> >> >>> Live...
> >> >>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
> >> >Live
> >> >>> Go...
> >> >>>                                      Live:   OO#.. Dead: OO#..
> >> >Playing
> >> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >> >with
> >> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >> >>> rocks...1k
> >> >>>
> >> >>> ______________________________________________
> >> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> PLEASE do read the posting guide http://www.R-project.org/posti
> >> >>> ng-guide.html
> >> >>> and provide commented, minimal, self-contained, reproducible
> >code.
> >> >>>
> >> >>>
> >> >> ------------------------------------------------------------
> >> >> ---------------
> >> >> Jeff Newmiller                        The     .....       .....
> >Go
> >> >Live...
> >> >> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
> >Live
> >> >> Go...
> >> >>                                       Live:   OO#.. Dead: OO#..
> >> >Playing
> >> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >with
> >> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> >> >rocks...1k
> >> >> ------------------------------------------------------------
> >> >> ---------------
> >> >>
> >>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

Jeff Newmiller
I'll pass. Just because some non-CRAN "archive" package has bugs or your disk storage is flaky does not mean that any of dozens or hundreds of other compression tools (e.g. the built-in Windows "Send to compressed folder" pop-up menu) won't get it right, and we would know if it did fail because of the md5sum.
--
Sent from my phone. Please excuse my brevity.

On July 17, 2017 5:00:48 AM PDT, Anthony Damico <[hidden email]> wrote:

>hi, thanks again for taking the time.  since corrupted compression
>prompted
>the segfault for me in the first place, i've just posted the text file
>as-is.  it's a 2.4GB file so to be avoided on a metered internet
>connection.  i've updated the bugzilla report at
>https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
>relevant info.  these lines of code crash both windows R 3.4.1 and also
>linux R 3.3.3 for me.  thanks again
>
>
>    # consider changing `tempfile()` to a permanent location
>    # so you don't lose the large downloaded file after the crash
>    tf <- tempfile()
> download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt"
>, tf , mode = 'wb' )
>    sessionInfo()
>    x <- readLines( tf )
>
>
>
>
>On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller
><[hidden email]>
>wrote:
>
>> I am stuck. The archive package won't compile for me on Ubuntu, and
>the
>> CRANextra repo seems to be down so I cannot install packages on
>Windows
>> right now. Perhaps you can zip the corrupt text file and put it
>online
>> somewhere? Don't use the archive package to pack it since there seem
>to be
>> issues with that tool on your machine.
>>
>> I would discourage you from harassing the Brazilian government about
>their
>> RAR file because the RAR file seems fine (no NUL characters appear in
>the
>> text file) when extracted using the file-roller archive tool on
>Ubuntu.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 9:37:17 AM PDT, Anthony Damico <[hidden email]>
>> wrote:
>> >hi, yep, there are two problems -- but i think only the segfault is
>> >within
>> >the scope of a base R issue?  i need to look closer at the corrupted
>> >decompression and figure out whether i should talk to the brazilian
>> >government agency that creates that .rar file or open an issue with
>the
>> >archive package maintainer.  my goal in this thread is only to
>figure
>> >out
>> >how to replicate the goofy text file so the r team can turn it into
>an
>> >error instead of a segfault.
>> >
>> >the original example i sent stores the .txt file somewhere inside
>the
>> >tempdir(), but when i copy it over elsewhere on my machine, the
>> >md5sum()
>> >gives the same result.  thanks again for looking at this
>> >
>> >    > tools::md5sum(infile)
>> >
>> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
>> folder/Microdados
>> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>> >    "30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >    > tools::md5sum( "S:/temp/crash.txt" )
>> >                     S:/temp/crash.txt
>> >    "30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >
>> >
>> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
>> ><[hidden email]>
>> >wrote:
>> >
>> >> So you are saying there are two problems... one that produces a
>> >corrupt
>> >> file from a valid compressed file, and one that segfaults when
>> >presented
>> >> with that corrupt file? Can you please confirm the file name and
>run
>> >md5sum
>> >> on it and share the result so we can tell when the file problem
>has
>> >been
>> >> reproduced?
>> >> --
>> >> Sent from my phone. Please excuse my brevity.
>> >>
>> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico
><[hidden email]>
>> >> wrote:
>> >> >hi, thank you for attempting this. it looks like your unix
>machine
>> >> >unzipped
>> >> >the txt file without corruption -- if you copied over the same
>txt
>> >file
>> >> >to
>> >> >windows 7, i don't think that would reproduce the problem?  i
>think
>> >it
>> >> >needs to be the corrupted text file where   R.utils::countLines(
>> >> >txtfile
>> >> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >> >machines
>> >> >but no guarantee i'm not doing something dumb
>> >> >
>> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> >> ><[hidden email]>
>> >> >wrote:
>> >> >
>> >> >> I am not able to reproduce your segfault on a Windows 7
>platform
>> >> >either:
>> >> >>
>> >> >> ##########################
>> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> >> sessionInfo()
>> >> >> ## R version 3.4.1 (2017-06-30)
>> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> >> ##
>> >> >> ## Matrix products: default
>> >> >> ##
>> >> >> ## locale:
>> >> >> ## [1] LC_COLLATE=English_United States.1252
>> >> >> ## [2] LC_CTYPE=English_United States.1252
>> >> >> ## [3] LC_MONETARY=English_United States.1252
>> >> >> ## [4] LC_NUMERIC=C
>> >> >> ## [5] LC_TIME=English_United States.1252
>> >> >> ##
>> >> >> ## attached base packages:
>> >> >> ## [1] stats     graphics  grDevices utils     datasets
>methods
>> >> >base
>> >> >> ##
>> >> >> ## loaded via a namespace (and not attached):
>> >> >> ## [1] compiler_3.4.1
>> >> >> tools::md5sum( fn1 )
>> >> >> ##             d:/DADOS_ENEM_2009.txt
>> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> >> dat <- readLines( fn1 )
>> >> >> length( dat )
>> >> >> ## [1] 4148721
>> >> >>
>> >> >>
>> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >> >>
>> >> >> I am not able to reproduce this on a Linux platform:
>> >> >>>
>> >> >>> #######################3
>> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados
>Enem
>> >> >>> 2009/DADOS_ENEM_2009.txt"
>> >> >>> sessionInfo()
>> >> >>> ## R version 3.4.1 (2017-06-30)
>> >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >> >>> ##
>> >> >>> ## Matrix products: default
>> >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >> >>> ##
>> >> >>> ## locale:
>> >> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> >> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> >> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >> >>> ##
>> >> >>> ## attached base packages:
>> >> >>> ## [1] stats     graphics  grDevices utils     datasets
>methods
>> >> >base
>> >> >>> ##
>> >> >>> ## loaded via a namespace (and not attached):
>> >> >>> ## [1] compiler_3.4.1
>> >> >>> tools::md5sum( fn1 )
>> >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >> >>> 2009/DADOS_ENEM_2009.txt
>> >> >>> ##
>> >> >>> "83e61c96092285b60d7bf6b0dbc7072e"
>> >> >>> dat <- readLines( fn1 )
>> >> >>> length( dat )
>> >> >>> ## [1] 4148721
>> >> >>>
>> >> >>> No segfault occurs.
>> >> >>>
>> >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>> >> >>>
>> >> >>> hi, i realized that the segfault happens on the text file in a
>> >new R
>> >> >>>> session.  so, creating the segfault-generating text file
>> >requires a
>> >> >>>> contributed package, but prompting the actual segfault does
>not
>> >--
>> >> >pretty
>> >> >>>> sure that means this is a base R bug?  submitted here:
>> >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> >> >hopefully i
>> >> >>>> am
>> >> >>>> not doing something remarkably stupid.  the text file itself
>is
>> >4GB
>> >> >so
>> >> >>>> cannot upload it to bugzilla, and from the
>R_AllocStringBugger
>> >> >error in
>> >> >>>> the
>> >> >>>> previous message, i think most or all of it needs to be there
>to
>> >> >trigger
>> >> >>>> the segfault.  thanks!
>> >> >>>>
>> >> >>>>
>> >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>> >> ><[hidden email]>
>> >> >>>> wrote:
>> >> >>>>
>> >> >>>> hi, thanks Dr. Murdoch
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> i'd appreciate if anyone on r-help could help me narrow this
>> >down?
>> >> > i
>> >> >>>>> believe the segfault occurs because there's a single line
>with
>> >4GB
>> >> >and
>> >> >>>>> also
>> >> >>>>> embedded nuls, but i am not sure how to artificially
>construct
>> >> >that?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> the lodown package can be removed from my example..  it is
>just
>> >> >for file
>> >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
>> >> >>>>> `download.file`  my current example requires a huge
>download,
>> >so
>> >> >sort of
>> >> >>>>> painful to repeat but i'm pretty confident that's not the
>> >issue.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> the archive::archive_extract() function unzips a (probably
>> >> >corrupt) .RAR
>> >> >>>>> file and creates a text file with 80,937 lines.  this file
>is
>> >4GB:
>> >> >>>>>
>> >> >>>>>    > file.size(infile)
>> >> >>>>>     [1] 4078192743 <(407)%20819-2743>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> i am pretty sure that nearly all of that 4GB is contained on
>a
>> >> >single
>> >> >>>>> line
>> >> >>>>> in the file.  here's what happens when i create a file
>> >connection
>> >> >and
>> >> >>>>> scan
>> >> >>>>> through..
>> >> >>>>>
>> >> >>>>>    > file_con <- file( infile , 'r' )
>> >> >>>>>    >
>> >> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "1000023930632009"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "36F2924009PAULO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "AFONSO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "BA11"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "00000"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "00"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "2924009PAULO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "AFONSO"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "BA1111"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "467.20"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "346.10"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Read 1 item
>> >> >>>>>     [1] "414.40"
>> >> >>>>>    > scan( w , n = 1 , what = character() )
>> >> >>>>>     Error in scan(w, n = 1, what = character()) :
>> >> >>>>>       could not allocate memory (2048 Mb) in C function
>> >> >>>>> 'R_AllocStringBuffer'
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> making a huge single-line file does not reproduce the
>problem,
>> >i
>> >> >think
>> >> >>>>> the
>> >> >>>>> embedded nuls have something to do with it--
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>     # WARNING do not run with less than 64GB RAM
>> >> >>>>>     tf <- tempfile()
>> >> >>>>>     a <- rep( "a" , 1000000000 )
>> >> >>>>>     b <- paste( a , collapse = '' )
>> >> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>> >> >>>>>     d <- readLines( tf )
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>> >> >>>>> [hidden email]>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>> >> >>>>>>
>> >> >>>>>> hello, the last line of the code below causes a segfault
>for
>> >me
>> >> >on
>> >> >>>>>>> 3.4.1.
>> >> >>>>>>> i think i should submit to https://bugs.r-project.org/
>> >unless
>> >> >others
>> >> >>>>>>> have
>> >> >>>>>>> advice?  thanks
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come
>up
>> >> >with a
>> >> >>>>>> self-contained example, not using the lodown and archive
>> >> >packages.  I
>> >> >>>>>> imagine you can do this by uploading the file you
>downloaded,
>> >or
>> >> >>>>>> enough of
>> >> >>>>>> a subset of it to trigger the segfault.  If you can't do
>that,
>> >> >then
>> >> >>>>>> likely
>> >> >>>>>> the bug is with one of those packages, not with R.
>> >> >>>>>>
>> >> >>>>>> Duncan Murdoch
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> install.packages( "devtools" )
>> >> >>>>>>> devtools::install_github("ajdamico/lodown")
>> >> >>>>>>> devtools::install_github("jimhester/archive")
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>> >> >>>>>>>
>> >> >>>>>>> tf <- tempfile()
>> >> >>>>>>>
>> >> >>>>>>> # large download!  cachaca saves on your local disk if
>> >already
>> >> >>>>>>> downloaded
>> >> >>>>>>> lodown::cachaca( '
>> >> >>>>>>>
>> >http://download.inep.gov.br/microdados/microdados_enem2009.rar'
>> >> >, tf
>> >> >>>>>>> ,
>> >> >>>>>>> mode
>> >> >>>>>>> = 'wb' )
>> >> >>>>>>>
>> >> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
>> >file_folder
>> >> >) )
>> >> >>>>>>>
>> >> >>>>>>> unzipped_files <- list.files( file_folder , recursive =
>TRUE
>> >,
>> >> >>>>>>> full.names =
>> >> >>>>>>> TRUE  )
>> >> >>>>>>>
>> >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files ,
>value =
>> >> >TRUE )
>> >> >>>>>>>
>> >> >>>>>>> # works
>> >> >>>>>>> R.utils::countLines( infile )
>> >> >>>>>>>
>> >> >>>>>>> # works with warning
>> >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>> >> >>>>>>>
>> >> >>>>>>> # crash
>> >> >>>>>>> my_file <- readLines( infile )
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> # run just before crash
>> >> >>>>>>> sessionInfo()
>> >> >>>>>>> # R version 3.4.1 (2017-06-30)
>> >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> >>>>>>> # Running under: Windows 10 x64 (build 15063)
>> >> >>>>>>>
>> >> >>>>>>> # Matrix products: default
>> >> >>>>>>>
>> >> >>>>>>> # locale:
>> >> >>>>>>> # [1] LC_COLLATE=English_United States.1252
>> >> >>>>>>> # [2] LC_CTYPE=English_United States.1252
>> >> >>>>>>> # [3] LC_MONETARY=English_United States.1252
>> >> >>>>>>> # [4] LC_NUMERIC=C
>> >> >>>>>>> # [5] LC_TIME=English_United States.1252
>> >> >>>>>>>
>> >> >>>>>>> # attached base packages:
>> >> >>>>>>> # [1] stats     graphics  grDevices utils     datasets
>> >methods
>> >> > base
>> >> >>>>>>>
>> >> >>>>>>> # loaded via a namespace (and not attached):
>> >> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>> >> >>>>>>>  withr_1.0.2
>> >> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> >> >>>>>>> memoise_1.1.0
>> >> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> >> >>>>>>> lodown_0.1.0
>> >> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1      
>devtools_1.13.2
>> >> >>>>>>> R.oo_1.21.0
>> >> >>>>>>> # [17] archive_0.0.0.9000
>> >> >>>>>>>
>> >> >>>>>>>         [[alternative HTML version deleted]]
>> >> >>>>>>>
>> >> >>>>>>> ______________________________________________
>> >> >>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and
>more,
>> >> >see
>> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>>>>> PLEASE do read the posting guide
>> >http://www.R-project.org/posti
>> >> >>>>>>> ng-guide.html
>> >> >>>>>>> and provide commented, minimal, self-contained,
>reproducible
>> >> >code.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>         [[alternative HTML version deleted]]
>> >> >>>>
>> >> >>>> ______________________________________________
>> >> >>>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
>> >see
>> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >> >>>> ng-guide.html
>> >> >>>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >> >>>>
>> >> >>>>
>> >> >>> ------------------------------------------------------------
>> >> >>> ---------------
>> >> >>> Jeff Newmiller                        The     .....      
>.....
>> >Go
>> >> >>> Live...
>> >> >>> DCN:<[hidden email]>        Basics: ##.#.      
>##.#.
>> >> >Live
>> >> >>> Go...
>> >> >>>                                      Live:   OO#.. Dead: OO#..
>> >> >Playing
>> >> >>> Research Engineer (Solar/Batteries            O.O#.      
>#.O#.
>> >> >with
>> >> >>> /Software/Embedded Controllers)               .OO#.      
>.OO#.
>> >> >>> rocks...1k
>> >> >>>
>> >> >>> ______________________________________________
>> >> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more,
>see
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide
>http://www.R-project.org/posti
>> >> >>> ng-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible
>> >code.
>> >> >>>
>> >> >>>
>> >> >> ------------------------------------------------------------
>> >> >> ---------------
>> >> >> Jeff Newmiller                        The     .....       .....
>> >Go
>> >> >Live...
>> >> >> DCN:<[hidden email]>        Basics: ##.#.       ##.#.
>> >Live
>> >> >> Go...
>> >> >>                                       Live:   OO#.. Dead: OO#..
>> >> >Playing
>> >> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>> >with
>> >> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>> >> >rocks...1k
>> >> >> ------------------------------------------------------------
>> >> >> ---------------
>> >> >>
>> >>
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: readLines without skipNul=TRUE causes crash

R help mailing list-2
In reply to this post by R help mailing list-2
The original file had a lot of trailing null bytes so I tried making a
similar file with:

tf <- tempfile(); file <- file(tf, "wb")
for(i in 1:(2^15-1))writeBin(rep(as.raw(32:127), len=2^16), file)
for(i in 1:(2^15-1))writeBin(rep(as.raw(0L), len=2^16), file)
close(file)
log2(file.size(tf))
#[1] 31.99996

Reading this with readLines() caused R-3.4.0 to segfault in Rf_con_pushback
with the same gdb traceback I saw when reading the original file.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 4:28 PM, William Dunlap <[hidden email]> wrote:

> I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
> debugging but gdb gives some information when I attach the debugger after
> the 'R..has stopped working' popup appears.  I don't know how reliable it
> is:
>
> (gdb) info threads
>   Id   Target Id         Frame
> * 4    Thread 11848.0x1500 0x00007ffe38dc8861 in ntdll!DbgBreakPoint ()
> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   3    Thread 11848.0x2e90 0x00007ffe38dc87e4 in ntdll!ZwWaitForWorkViaWorkerFactory
> ()
>    from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   2    Thread 11848.0x3618 0x00007ffe38dc5154 in
> ntdll!ZwWaitForSingleObject ()
>    from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   1    Thread 11848.0x1808 0x000000006c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> (gdb) thread 1
> [Switching to thread 1 (Thread 11848.0x1808)]
> #0  0x000000006c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> (gdb) where
> #0  0x000000006c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #1  0x000000006c7d8919 in R_initAssignSymbols () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #2  0x000000006c7ef961 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #3  0x000000006c7f1b70 in R_cmpfun1 () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #4  0x000000006c7f1ef2 in Rf_applyClosure () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #5  0x000000006c7efaf7 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #6  0x000000006c7f3816 in R_execMethod () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #7  0x000000006c7efcdf in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #8  0x000000006c81053c in Rf_ReplIteration () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #9  0x000000006c810902 in Rf_ReplIteration () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #10 0x000000006c810992 in run_Rmainloop () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #11 0x000000000040171c in ?? ()
> #12 0x000000000040155a in ?? ()
> #13 0x00000000004013e8 in ?? ()
> #14 0x000000000040151b in ?? ()
> #15 0x00007ffe37868102 in KERNEL32!BaseThreadInitThunk () from
> /cygdrive/c/WINDOWS/system32/KERNEL32.DLL
> #16 0x00007ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
> /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
> #17 0x0000000000000000 in ?? ()
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb)
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller <[hidden email]>
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##########################
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ##             d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> #######################3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
>>>> session.  so, creating the segfault-generating text file requires a
>>>> contributed package, but prompting the actual segfault does not --
>>>> pretty
>>>> sure that means this is a base R bug?  submitted here:
>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully
>>>> i am
>>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>>> the
>>>> previous message, i think most or all of it needs to be there to trigger
>>>> the segfault.  thanks!
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>>>> wrote:
>>>>
>>>> hi, thanks Dr. Murdoch
>>>>>
>>>>>
>>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>>> believe the segfault occurs because there's a single line with 4GB and
>>>>> also
>>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>>
>>>>>
>>>>> the lodown package can be removed from my example..  it is just for
>>>>> file
>>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>>> `download.file`  my current example requires a huge download, so sort
>>>>> of
>>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>>
>>>>>
>>>>> the archive::archive_extract() function unzips a (probably corrupt)
>>>>> .RAR
>>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>>
>>>>>    > file.size(infile)
>>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>>
>>>>>
>>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>>> line
>>>>> in the file.  here's what happens when i create a file connection and
>>>>> scan
>>>>> through..
>>>>>
>>>>>    > file_con <- file( infile , 'r' )
>>>>>    >
>>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "1000023930632009"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "36F2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA11"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00000"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "00"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "2924009PAULO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "AFONSO"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "BA1111"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "467.20"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "346.10"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Read 1 item
>>>>>     [1] "414.40"
>>>>>    > scan( w , n = 1 , what = character() )
>>>>>     Error in scan(w, n = 1, what = character()) :
>>>>>       could not allocate memory (2048 Mb) in C function
>>>>> 'R_AllocStringBuffer'
>>>>>
>>>>>
>>>>>
>>>>> making a huge single-line file does not reproduce the problem, i think
>>>>> the
>>>>> embedded nuls have something to do with it--
>>>>>
>>>>>
>>>>>     # WARNING do not run with less than 64GB RAM
>>>>>     tf <- tempfile()
>>>>>     a <- rep( "a" , 1000000000 )
>>>>>     b <- paste( a , collapse = '' )
>>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>>     d <- readLines( tf )
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>>
>>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>>> 3.4.1.
>>>>>>> i think i should submit to https://bugs.r-project.org/  unless
>>>>>>> others
>>>>>>> have
>>>>>>> advice?  thanks
>>>>>>>
>>>>>>>
>>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>>> enough of
>>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>>> likely
>>>>>> the bug is with one of those packages, not with R.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> install.packages( "devtools" )
>>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>>> devtools::install_github("jimhester/archive")
>>>>>>>
>>>>>>>
>>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>>
>>>>>>> tf <- tempfile()
>>>>>>>
>>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>>> downloaded
>>>>>>> lodown::cachaca( '
>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' ,
>>>>>>> tf ,
>>>>>>> mode
>>>>>>> = 'wb' )
>>>>>>>
>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>>
>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>>> full.names =
>>>>>>> TRUE  )
>>>>>>>
>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>>
>>>>>>> # works
>>>>>>> R.utils::countLines( infile )
>>>>>>>
>>>>>>> # works with warning
>>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>>
>>>>>>> # crash
>>>>>>> my_file <- readLines( infile )
>>>>>>>
>>>>>>>
>>>>>>> # run just before crash
>>>>>>> sessionInfo()
>>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>>
>>>>>>> # Matrix products: default
>>>>>>>
>>>>>>> # locale:
>>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>>> # [4] LC_NUMERIC=C
>>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>>
>>>>>>> # attached base packages:
>>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
>>>>>>>  base
>>>>>>>
>>>>>>> # loaded via a namespace (and not attached):
>>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>>  withr_1.0.2
>>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>>> memoise_1.1.0
>>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>>> lodown_0.1.0
>>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>>> R.oo_1.21.0
>>>>>>> # [17] archive_0.0.0.9000
>>>>>>>
>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>>> ng-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ---------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>> Live...
>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>> Go...
>>>                                      Live:   OO#.. Dead: OO#..  Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> rocks...1k
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
12