readLines without skipNul=TRUE causes crash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: readLines without skipNul=TRUE causes crash

ajdamico
awesome, thank you! looks like folks on bugzilla have also reproduced and
submitted a patch, so i am happy. thanks all

On Mon, Jul 17, 2017 at 11:36 AM, William Dunlap <[hidden email]> wrote:

> The original file had a lot of trailing null bytes so I tried making a
> similar file with:
>
> tf <- tempfile(); file <- file(tf, "wb")
> for(i in 1:(2^15-1))writeBin(rep(as.raw(32:127), len=2^16), file)
> for(i in 1:(2^15-1))writeBin(rep(as.raw(0L), len=2^16), file)
> close(file)
> log2(file.size(tf))
> #[1] 31.99996
>
> Reading this with readLines() caused R-3.4.0 to segfault in
> Rf_con_pushback with the same gdb traceback I saw when reading the original
> file.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Jul 15, 2017 at 4:28 PM, William Dunlap <[hidden email]> wrote:
>
>> I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
>> debugging but gdb gives some information when I attach the debugger after
>> the 'R..has stopped working' popup appears.  I don't know how reliable it
>> is:
>>
>> (gdb) info threads
>>   Id   Target Id         Frame
>> * 4    Thread 11848.0x1500 0x00007ffe38dc8861 in ntdll!DbgBreakPoint ()
>> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   3    Thread 11848.0x2e90 0x00007ffe38dc87e4 in
>> ntdll!ZwWaitForWorkViaWorkerFactory ()
>>    from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   2    Thread 11848.0x3618 0x00007ffe38dc5154 in
>> ntdll!ZwWaitForSingleObject ()
>>    from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   1    Thread 11848.0x1808 0x000000006c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> (gdb) thread 1
>> [Switching to thread 1 (Thread 11848.0x1808)]
>> #0  0x000000006c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> (gdb) where
>> #0  0x000000006c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #1  0x000000006c7d8919 in R_initAssignSymbols () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #2  0x000000006c7ef961 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #3  0x000000006c7f1b70 in R_cmpfun1 () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #4  0x000000006c7f1ef2 in Rf_applyClosure () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #5  0x000000006c7efaf7 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #6  0x000000006c7f3816 in R_execMethod () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #7  0x000000006c7efcdf in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #8  0x000000006c81053c in Rf_ReplIteration () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #9  0x000000006c810902 in Rf_ReplIteration () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #10 0x000000006c810992 in run_Rmainloop () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #11 0x000000000040171c in ?? ()
>> #12 0x000000000040155a in ?? ()
>> #13 0x00000000004013e8 in ?? ()
>> #14 0x000000000040151b in ?? ()
>> #15 0x00007ffe37868102 in KERNEL32!BaseThreadInitThunk () from
>> /cygdrive/c/WINDOWS/system32/KERNEL32.DLL
>> #16 0x00007ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
>> /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>> #17 0x0000000000000000 in ?? ()
>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>> (gdb)
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller <[hidden email]
>> > wrote:
>>
>>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>>
>>> ##########################
>>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>>> ##
>>> ## Matrix products: default
>>> ##
>>> ## locale:
>>> ## [1] LC_COLLATE=English_United States.1252
>>> ## [2] LC_CTYPE=English_United States.1252
>>> ## [3] LC_MONETARY=English_United States.1252
>>> ## [4] LC_NUMERIC=C
>>> ## [5] LC_TIME=English_United States.1252
>>> ##
>>> ## attached base packages:
>>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ##             d:/DADOS_ENEM_2009.txt
>>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>>
>>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>>
>>> I am not able to reproduce this on a Linux platform:
>>>>
>>>> #######################3
>>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>>> 2009/DADOS_ENEM_2009.txt"
>>>> sessionInfo()
>>>> ## R version 3.4.1 (2017-06-30)
>>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>>> ## Running under: Ubuntu 14.04.5 LTS
>>>> ##
>>>> ## Matrix products: default
>>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>>> ##
>>>> ## locale:
>>>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>> ##
>>>> ## attached base packages:
>>>> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> ##
>>>> ## loaded via a namespace (and not attached):
>>>> ## [1] compiler_3.4.1
>>>> tools::md5sum( fn1 )
>>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>>> 2009/DADOS_ENEM_2009.txt
>>>> ##
>>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>>> dat <- readLines( fn1 )
>>>> length( dat )
>>>> ## [1] 4148721
>>>>
>>>> No segfault occurs.
>>>>
>>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>>
>>>> hi, i realized that the segfault happens on the text file in a new R
>>>>> session.  so, creating the segfault-generating text file requires a
>>>>> contributed package, but prompting the actual segfault does not --
>>>>> pretty
>>>>> sure that means this is a base R bug?  submitted here:
>>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully
>>>>> i am
>>>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error
>>>>> in the
>>>>> previous message, i think most or all of it needs to be there to
>>>>> trigger
>>>>> the segfault.  thanks!
>>>>>
>>>>>
>>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <[hidden email]>
>>>>> wrote:
>>>>>
>>>>> hi, thanks Dr. Murdoch
>>>>>>
>>>>>>
>>>>>> i'd appreciate if anyone on r-help could help me narrow this down?  i
>>>>>> believe the segfault occurs because there's a single line with 4GB
>>>>>> and also
>>>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>>>
>>>>>>
>>>>>> the lodown package can be removed from my example..  it is just for
>>>>>> file
>>>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>>>> `download.file`  my current example requires a huge download, so sort
>>>>>> of
>>>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>>>
>>>>>>
>>>>>> the archive::archive_extract() function unzips a (probably corrupt)
>>>>>> .RAR
>>>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
>>>>>>
>>>>>>    > file.size(infile)
>>>>>>     [1] 4078192743 <(407)%20819-2743>
>>>>>>
>>>>>>
>>>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>>>> line
>>>>>> in the file.  here's what happens when i create a file connection and
>>>>>> scan
>>>>>> through..
>>>>>>
>>>>>>    > file_con <- file( infile , 'r' )
>>>>>>    >
>>>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "1000023930632009"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "36F2924009PAULO"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "AFONSO"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "BA11"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "00000"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "00"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "2924009PAULO"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "AFONSO"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "BA1111"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "467.20"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "346.10"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Read 1 item
>>>>>>     [1] "414.40"
>>>>>>    > scan( w , n = 1 , what = character() )
>>>>>>     Error in scan(w, n = 1, what = character()) :
>>>>>>       could not allocate memory (2048 Mb) in C function
>>>>>> 'R_AllocStringBuffer'
>>>>>>
>>>>>>
>>>>>>
>>>>>> making a huge single-line file does not reproduce the problem, i
>>>>>> think the
>>>>>> embedded nuls have something to do with it--
>>>>>>
>>>>>>
>>>>>>     # WARNING do not run with less than 64GB RAM
>>>>>>     tf <- tempfile()
>>>>>>     a <- rep( "a" , 1000000000 )
>>>>>>     b <- paste( a , collapse = '' )
>>>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
>>>>>>     d <- readLines( tf )
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>>>> [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>>>
>>>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>>>> 3.4.1.
>>>>>>>> i think i should submit to https://bugs.r-project.org/  unless
>>>>>>>> others
>>>>>>>> have
>>>>>>>> advice?  thanks
>>>>>>>>
>>>>>>>>
>>>>>>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>>>>>>> self-contained example, not using the lodown and archive packages.  I
>>>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>>>> enough of
>>>>>>> a subset of it to trigger the segfault.  If you can't do that, then
>>>>>>> likely
>>>>>>> the bug is with one of those packages, not with R.
>>>>>>>
>>>>>>> Duncan Murdoch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> install.packages( "devtools" )
>>>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>>>> devtools::install_github("jimhester/archive")
>>>>>>>>
>>>>>>>>
>>>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>>>
>>>>>>>> tf <- tempfile()
>>>>>>>>
>>>>>>>> # large download!  cachaca saves on your local disk if already
>>>>>>>> downloaded
>>>>>>>> lodown::cachaca( '
>>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' ,
>>>>>>>> tf ,
>>>>>>>> mode
>>>>>>>> = 'wb' )
>>>>>>>>
>>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>>>
>>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>>>> full.names =
>>>>>>>> TRUE  )
>>>>>>>>
>>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>>>
>>>>>>>> # works
>>>>>>>> R.utils::countLines( infile )
>>>>>>>>
>>>>>>>> # works with warning
>>>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>>>
>>>>>>>> # crash
>>>>>>>> my_file <- readLines( infile )
>>>>>>>>
>>>>>>>>
>>>>>>>> # run just before crash
>>>>>>>> sessionInfo()
>>>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>>>
>>>>>>>> # Matrix products: default
>>>>>>>>
>>>>>>>> # locale:
>>>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>>>> # [4] LC_NUMERIC=C
>>>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>>>
>>>>>>>> # attached base packages:
>>>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
>>>>>>>>  base
>>>>>>>>
>>>>>>>> # loaded via a namespace (and not attached):
>>>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>>>>>>>  withr_1.0.2
>>>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>>>>>>>> memoise_1.1.0
>>>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>>>>>>>> lodown_0.1.0
>>>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>>>>>>>> R.oo_1.21.0
>>>>>>>> # [17] archive_0.0.0.9000
>>>>>>>>
>>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>>>> ng-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>> ------------------------------------------------------------
>>>> ---------------
>>>> Jeff Newmiller                        The     .....       .....  Go
>>>> Live...
>>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>>> Go...
>>>>                                      Live:   OO#.. Dead: OO#..  Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>> rocks...1k
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ---------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>> Live...
>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>> Go...
>>>                                       Live:   OO#.. Dead: OO#..  Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> rocks...1k
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: readLines without skipNul=TRUE causes crash

Martin Maechler
In reply to this post by ajdamico
>>>>> Anthony Damico <[hidden email]>
>>>>>     on Sun, 16 Jul 2017 06:40:38 -0400 writes:

    > hi, the text file that prompts the segfault is 4gb but only 80,937 lines
    >> file.info( "S:/temp/crash.txt")
    > size isdir mode               mtime
    > ctime               atime exe
    > S:/temp/crash.txt 4078192743 FALSE  666 2017-07-15 17:24:35 2017-07-15
    > 17:19:47 2017-07-15 17:19:47  no


    > On Sun, Jul 16, 2017 at 6:34 AM, Duncan Murdoch <[hidden email]>
    > wrote:

    >> On 16/07/2017 6:17 AM, Anthony Damico wrote:
    >>
    >>> thank you for taking the time to write this.  i set it running last
    >>> night and it's still going -- if it doesn't finish by tomorrow, i will
    >>> try to find a site to host the problem file and add that link to the bug
    >>> report so the archive package can be avoided at least.  i'm sorry for
    >>> the bother
    >>>
    >>>
    >> How big is that text file?  I wouldn't expect my script to take more than
    >> a few minutes even on a huge file.
    >>
    >> My script might have a bug...
    >>
    >> Duncan Murdoch
    >>
    >> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
    >>> <[hidden email] <mailto:[hidden email]>> wrote:
    >>>
    >>> On 15/07/2017 11:33 AM, Anthony Damico wrote:
    >>>
    >>> hi, i realized that the segfault happens on the text file in a
    >>> new R
    >>> session.  so, creating the segfault-generating text file requires
    >>> a
    >>> contributed package, but prompting the actual segfault does not --
    >>> pretty sure that means this is a base R bug?  submitted here:
    >>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
    >>> <https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311>
    >>> hopefully i am not doing something remarkably stupid.  the text file itself
    >>> is 4GB
    >>> so cannot upload it to bugzilla, and from the
    >>> R_AllocStringBugger error
    >>> in the previous message, i think most or all of it needs to be
    >>> there to
    >>> trigger the segfault.  thanks!

In the mean time, communication has continued a bit at the bugzilla bug tracker
(https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 ), and
as you can read there, the bug is fixed now, also thanks to an
initial patch proposal by Hannes Mühleisen.

Martin Maechler
ETH Zurich (and R Core)

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
12
Loading...