Very hard to reproduce bug (?) in R-devel

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Very hard to reproduce bug (?) in R-devel

Winston Chang
When running R CMD check on a package, we are encountering an error on
R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
reproduce, and almost any change to the code makes the error go away.
I believe that this is due to a bug in R-devel, which has been present
since at least commit 72128 (on 2017-02-06).

The test error occurs when R CMD check is run on Travis CI (on Ubuntu
12.04.05), and locally on R-devel (on Debian testing). If we make any
of the following changes, the error does not occur, and the test
passes:
- Run the test scripts from an interactive R session
- Run on a Mac instead of Linux
- Use R 3.3.3 or 3.2.5
- Add code right before the tests. For example, just running gc() or
format(1) right before the test causes it to not fail. However, adding
very simple code like 1+1 does not affect the test (it still fails).
- Run R CMD check with -d gdb or -d valgrind

Because the test is so sensitive to any changes, we've been unable to
track down the source of the problem. We can't be sure, but the error
message suggests that a promise is getting evaluated with the wrong
environment.

Here's the relevant output from the tests:
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
       scheduleTask(1000, function() {
           results <<- c(results, 3L)
       })
       scheduleTask(100, function() {
           results <<- c(results, 2L)
       })
       scheduleTask(10, function() {
           results <<- c(results, 1L)
       })
   }) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()
4: self$protect({
       task()
   })
5: private$refCount$release()
6: private$callback()
7: self$protect({
       task()
   })
8: force(expr)

You can see complete logs on Travis here:
  https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771

In the test, a re-entrant function is called (self$protect, at levels
4 and 7 on the call stack), and in the inner call (7), a variable,
`task`, does not have the correct value. It should be a function, but
it isn't; it's either NULL or some other non-function value.

That code is here:
  https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
Note that there's a `force(task)` a couple lines above, which is
commented out. If that is uncommented, the test passes, even though
calling force() there shouldn't make a difference. Also, calling
`format(1)` at that location also causes the test to pass.


I've written instructions to reproduce the problem in a Docker container here:
  https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f

If you want to try reproducing the error without using Docker, I
believe if you download pool_0.1.0.tar.gz file from the
gist.github.com URL above, and run R CMD check on it, that will do it.
Some upstream dependencies will likely need to be installed; they are
listed in the code in the gist.


We also used git bisect to try to find the first commit in R where the
test would fail. The instructions for doing this are also in the gist
linked to above. It found that the first commit where the test failed
was 72128:
  https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d

Taking a look at the changes in that commit, however, it seems very
unlikely that it caused the error. Instead, it probably just made a
change which allowed an already-existing bug to manifest.


We'd appreciate any help finding the cause of this bug!
-Winston

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
Small correction to my previous email: the error can also be
replicated by installing the package and then sourcing the tests. This
is slightly faster than running R CMD check. It is also how we ran the
tests with gdb. This will run the tests in a Docker container and
raise the same error. (To run outside of a Docker container, simply
leave out the first line).

  docker run --rm -ti pooltest /bin/bash
  RD CMD INSTALL pool_0.1.0.tar.gz  --install-tests
  cd /usr/local/lib/R/site-library/pool/tests
  RD -e 'source("testthat.R")'


At the suggestion of Tomas Kalibera, I ran the tests with
gctorture(TRUE). When I do this, I get three errors instead of just
one:

======================
# RD -e 'gctorture(TRUE); source("testthat.R")'

R Under development (unstable) (2017-03-23 r72389) -- "Unsuffered Consequences"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> gctorture(TRUE); source("testthat.R")
1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21) ---------
gc() showed 0 warnings


2. Failure: pool scheduler: schedules things in the right order
(@test-scheduling.R#24)
`results` not identical to 1:3.
2/3 mismatches (average diff: 1)
[1] 2 - 1 ==  1
[2] 1 - 2 == -1


3. Failure: pool scheduler: works with pool (@test-scheduling.R#31) ------------
pool$counters$free not identical to 1.
1/1 mismatches
[1] 0 - 1 == -1


testthat results
================================================================
OK: 199 SKIPPED: 0 FAILED: 3
1. Failure: pool: checks for leaks (named) (@test-leak-detection.R#21)
2. Failure: pool scheduler: schedules things in the right order
(@test-scheduling.R#24)
3. Failure: pool scheduler: works with pool (@test-scheduling.R#31)

Error: testthat unit tests failed
Execution halted

======================




On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <[hidden email]> wrote:

> When running R CMD check on a package, we are encountering an error on
> R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
> reproduce, and almost any change to the code makes the error go away.
> I believe that this is due to a bug in R-devel, which has been present
> since at least commit 72128 (on 2017-02-06).
>
> The test error occurs when R CMD check is run on Travis CI (on Ubuntu
> 12.04.05), and locally on R-devel (on Debian testing). If we make any
> of the following changes, the error does not occur, and the test
> passes:
> - Run the test scripts from an interactive R session
> - Run on a Mac instead of Linux
> - Use R 3.3.3 or 3.2.5
> - Add code right before the tests. For example, just running gc() or
> format(1) right before the test causes it to not fail. However, adding
> very simple code like 1+1 does not affect the test (it still fails).
> - Run R CMD check with -d gdb or -d valgrind
>
> Because the test is so sensitive to any changes, we've been unable to
> track down the source of the problem. We can't be sure, but the error
> message suggests that a promise is getting evaluated with the wrong
> environment.
>
> Here's the relevant output from the tests:
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
>        scheduleTask(1000, function() {
>            results <<- c(results, 3L)
>        })
>        scheduleTask(100, function() {
>            results <<- c(results, 2L)
>        })
>        scheduleTask(10, function() {
>            results <<- c(results, 1L)
>        })
>    }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
> 4: self$protect({
>        task()
>    })
> 5: private$refCount$release()
> 6: private$callback()
> 7: self$protect({
>        task()
>    })
> 8: force(expr)
>
> You can see complete logs on Travis here:
>   https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
>
> In the test, a re-entrant function is called (self$protect, at levels
> 4 and 7 on the call stack), and in the inner call (7), a variable,
> `task`, does not have the correct value. It should be a function, but
> it isn't; it's either NULL or some other non-function value.
>
> That code is here:
>   https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
> Note that there's a `force(task)` a couple lines above, which is
> commented out. If that is uncommented, the test passes, even though
> calling force() there shouldn't make a difference. Also, calling
> `format(1)` at that location also causes the test to pass.
>
>
> I've written instructions to reproduce the problem in a Docker container here:
>   https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
>
> If you want to try reproducing the error without using Docker, I
> believe if you download pool_0.1.0.tar.gz file from the
> gist.github.com URL above, and run R CMD check on it, that will do it.
> Some upstream dependencies will likely need to be installed; they are
> listed in the code in the gist.
>
>
> We also used git bisect to try to find the first commit in R where the
> test would fail. The instructions for doing this are also in the gist
> linked to above. It found that the first commit where the test failed
> was 72128:
>   https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82226644953d
>
> Taking a look at the changes in that commit, however, it seems very
> unlikely that it caused the error. Instead, it probably just made a
> change which allowed an already-existing bug to manifest.
>
>
> We'd appreciate any help finding the cause of this bug!
> -Winston

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
>
>
>>
I've done some more investigation into the problem, and it is very
difficult to pin down. What it looks like is happening is roughly like this:
- `p` is an environment and `p$e` is also an environment.
- There is a loop. In each iteration, it looks for one item in `p$e`, saves
it in a variable `x`, then removes that item from `p$e`. Then it invokes
`x()`. The loop runs again, until there are no more items in `p$e`.

The problem is that `ls(p$e)` sometimes returns the wrong values -- it
returns the values that it had in previous iterations of the loop. The
behavior is very touchy. Almost any change to the code will slightly change
the behavior; sometimes the `ls()` returns values from a different
iteration of the loop, and sometimes the problem doesn't happen at all.

I've put a  Dockerfile and instructions for reproducing the problem here:
  https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88

I think that I've gotten about as far with this as I can, though I'd be
happy to provide more information if anyone wants to take look at the
problem.

-Winston

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Martin Maechler
>>>>> Winston Chang <[hidden email]>
>>>>>     on Tue, 4 Apr 2017 15:29:40 -0500 writes:

    > I've done some more investigation into the problem, and it is very
    > difficult to pin down. What it looks like is happening is roughly like this:
    > - `p` is an environment and `p$e` is also an environment.
    > - There is a loop. In each iteration, it looks for one item in `p$e`, saves
    > it in a variable `x`, then removes that item from `p$e`. Then it invokes
    > `x()`. The loop runs again, until there are no more items in `p$e`.

    > The problem is that `ls(p$e)` sometimes returns the wrong values -- it
    > returns the values that it had in previous iterations of the loop. The
    > behavior is very touchy. Almost any change to the code will slightly change
    > the behavior; sometimes the `ls()` returns values from a different
    > iteration of the loop, and sometimes the problem doesn't happen at all.

    > I've put a  Dockerfile and instructions for reproducing the problem here:
    > https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88

    > I think that I've gotten about as far with this as I can, though I'd be
    > happy to provide more information if anyone wants to take look at the
    > problem.

Dear Winston,

While I agree this may very well be a bug in R(-devel), and hence
also R in 3.4.0 alpha and hence quite important to be dealt with,

your code still involves 3 non-trivial  packages (DBI, R6,
testthat) some of which have their own C code and notably load
a couple of other package's namespaces.
We've always made a point
      https://www.r-project.org/bugs.html
that bugs in R should be reproducible without extra
packages... and I think it would definitely help to pinpoint the
issue to be seen outside of your extra packages' world.

Or have you been aware of that and are just asking for help
finding a bug in one of the extra packages involved, a bug that might only be triggered by recent changes in R ?

OTOH, what you describe above  (p ; p$e ; p$e$x ...)
should be reproducible in pure "base" R code, right?

I'm sorry not to be of more help
Martin

    > -Winston

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
)

On Wed, Apr 5, 2017 at 2:59 AM, Martin Maechler
<[hidden email]> wrote:

>
> >>>>> Winston Chang <[hidden email]>
> >>>>>     on Tue, 4 Apr 2017 15:29:40 -0500 writes:
>
>     > I've done some more investigation into the problem, and it is very
>     > difficult to pin down. What it looks like is happening is roughly like this:
>     > - `p` is an environment and `p$e` is also an environment.
>     > - There is a loop. In each iteration, it looks for one item in `p$e`, saves
>     > it in a variable `x`, then removes that item from `p$e`. Then it invokes
>     > `x()`. The loop runs again, until there are no more items in `p$e`.
>
>     > The problem is that `ls(p$e)` sometimes returns the wrong values -- it
>     > returns the values that it had in previous iterations of the loop. The
>     > behavior is very touchy. Almost any change to the code will slightly change
>     > the behavior; sometimes the `ls()` returns values from a different
>     > iteration of the loop, and sometimes the problem doesn't happen at all.
>
>     > I've put a  Dockerfile and instructions for reproducing the problem here:
>     > https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88
>
>     > I think that I've gotten about as far with this as I can, though I'd be
>     > happy to provide more information if anyone wants to take look at the
>     > problem.
>
> Dear Winston,
>
> While I agree this may very well be a bug in R(-devel), and hence
> also R in 3.4.0 alpha and hence quite important to be dealt with,
>
> your code still involves 3 non-trivial  packages (DBI, R6,
> testthat) some of which have their own C code and notably load
> a couple of other package's namespaces.
> We've always made a point
>       https://www.r-project.org/bugs.html
> that bugs in R should be reproducible without extra
> packages... and I think it would definitely help to pinpoint the
> issue to be seen outside of your extra packages' world.
>
> Or have you been aware of that and are just asking for help
> finding a bug in one of the extra packages involved, a bug that might only be triggered by recent changes in R ?
>
> OTOH, what you describe above  (p ; p$e ; p$e$x ...)
> should be reproducible in pure "base" R code, right?
>
> I'm sorry not to be of more help
> Martin


Of the four packages that are loaded when running the tests (pool,
DBI, R6, testthat, magrittr, crayon), only testthat contains compiled
code, and it is pretty minimal. The only compiled code in testthat
that should be executed is a function that finds a label -- but that
happens only after an error occurs.

This is the sessionInfo():
R Under development (unstable) (2017-03-23 r72389)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pool_0.1.0     DBI_0.6-1      testthat_1.0.2

loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5   R6_2.2.0       crayon_1.3.2


I have spent days trying to make a minimal example, and that is the
best that I have been able to do so far. I would not involve all this
complexity if I could avoid it. The problem is that the behavior is
extremely sensitive to any changes. Seemingly-innocuous things like
removing other tests, or adding a cat() statement can make the error
disappear, or in some cases it makes ls() return values from a
different (previous) iteration of the loop.


In my testing, I have also seen a case where calls to `cat( ... ,
file=stderr())` result in output that has the wrong order. I don't
know if that's due to the cat() calls being executed in the wrong
order, or if it's simply being printed or buffered in the wrong order.

This is the code in question that cat()s to stderr:
https://github.com/rstudio/pool/blob/0724ad9/R/scheduler.R#L74-L90
      while (TRUE) {
        tasks <- sort(ls(private$scheduledTasks))
        if (length(tasks) == 0) break
        t <- tasks[[1]]

        s <- stderr()
        cat(tasks, "--1--\n", file = s)
        cat(ls(private$scheduledTasks), "--2--\n", file = s)
        cat(t, "--3--\n", file = s)

        task <- private$scheduledTasks[[t]]
        rm(list = t, envir = private$scheduledTasks)

        task()
      }

Without going into too much detail, it should print lines of text that
end with --1--, --2--, --3--, and repeat. Here's what it prints
instead when running the tests:

20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --1--
20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --1--
20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --2--
20170405-182549.466875-18 --3--
20170405-182559.456628-17 20170405-182729.456318-16 --1--
20170405-182559.456628-17 20170405-182729.456318-16 --2--
20170405-182559.456628-17 --3--
20170405-182729.456318-16 --1--
20170405-182729.456318-16 --2--
20170405-182729.456318-16 --3--
 --2--
20170405-182549.466875-18 --3--
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
       scheduleTask(1e+05, function() {
           results <<- c(results, 3L)
       })
       scheduleTask(10000, function() {
           results <<- c(results, 2L)
       })
       scheduleTask(10, function() {
           results <<- c(results, 1L)
       })
   }) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()

It's almost as though, in the middle of the first iteration of the
while loop, R jumps to the next iteration of the loop, runs the loop a
couple of times to completion, and then returns to the first iteration
of the loop at the place that it left.

This can be reproduced by following the instructions in this gist:
  https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88

Almost any change to the code will make the error disappear, or change
to a different one.


With regard to reproducing it in "base" R: I made a simple example
using just R (no other packages) that does something similar to what
happens in the tests, but even when I run it for 100,000 iterations,
the error doesn't occur.

I think there's a good chance that this is due to a bug in R. I have
been trying to track down the cause of the problem but haven't been
able find it.

-Winston

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Robert McGehee-2
Winston,
I had a similar experience to you tracking down an insanely difficult bug in my R code that "disappeared" whenever slight changes were made to the script (e.g. like adding cat() statements). In my case, it coincided with my over-eager compilation of R and its library stack, as I was also experimenting with a cutting edge version of the gcc compiler as well as what I thought were innocuous performance enhancing CFLAGS like -O3/-Ofast -march=native. After downgrading gcc and recompiling everything (R and BLAS) without the extra flags, the problem went away. Not saying this is your problem, just sharing my similar experience.

<TANGENT> And for anyone interested, I did extensive benchmarking on the effects of the added CFLAGS and cutting edge gcc compilers, and I never saw any significant performance enhancement, and frequently saw a big performance penalty with extra flags, particularly as matrix algebra problems sometimes slowed down enormously when using a custom BLAS (ATLAS) compiled with anything fancy. Though nowadays, the out-of-the-box R BLAS seems much faster than it used to be, so I don't even bother fiddling with a custom BLAS. </TANGENT>

--Robert
 

-----Original Message-----
From: R-devel [mailto:[hidden email]] On Behalf Of Winston Chang
Sent: Wednesday, April 05, 2017 2:41 PM
To: Martin Maechler <[hidden email]>
Cc: R Development <[hidden email]>
Subject: Re: [Rd] Very hard to reproduce bug (?) in R-devel

)

On Wed, Apr 5, 2017 at 2:59 AM, Martin Maechler
<[hidden email]> wrote:

>
> >>>>> Winston Chang <[hidden email]>
> >>>>>     on Tue, 4 Apr 2017 15:29:40 -0500 writes:
>
>     > I've done some more investigation into the problem, and it is very
>     > difficult to pin down. What it looks like is happening is roughly like this:
>     > - `p` is an environment and `p$e` is also an environment.
>     > - There is a loop. In each iteration, it looks for one item in `p$e`, saves
>     > it in a variable `x`, then removes that item from `p$e`. Then it invokes
>     > `x()`. The loop runs again, until there are no more items in `p$e`.
>
>     > The problem is that `ls(p$e)` sometimes returns the wrong values -- it
>     > returns the values that it had in previous iterations of the loop. The
>     > behavior is very touchy. Almost any change to the code will slightly change
>     > the behavior; sometimes the `ls()` returns values from a different
>     > iteration of the loop, and sometimes the problem doesn't happen at all.
>
>     > I've put a  Dockerfile and instructions for reproducing the problem here:
>     > https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88
>
>     > I think that I've gotten about as far with this as I can, though I'd be
>     > happy to provide more information if anyone wants to take look at the
>     > problem.
>
> Dear Winston,
>
> While I agree this may very well be a bug in R(-devel), and hence
> also R in 3.4.0 alpha and hence quite important to be dealt with,
>
> your code still involves 3 non-trivial  packages (DBI, R6,
> testthat) some of which have their own C code and notably load
> a couple of other package's namespaces.
> We've always made a point
>       https://www.r-project.org/bugs.html
> that bugs in R should be reproducible without extra
> packages... and I think it would definitely help to pinpoint the
> issue to be seen outside of your extra packages' world.
>
> Or have you been aware of that and are just asking for help
> finding a bug in one of the extra packages involved, a bug that might only be triggered by recent changes in R ?
>
> OTOH, what you describe above  (p ; p$e ; p$e$x ...)
> should be reproducible in pure "base" R code, right?
>
> I'm sorry not to be of more help
> Martin


Of the four packages that are loaded when running the tests (pool,
DBI, R6, testthat, magrittr, crayon), only testthat contains compiled
code, and it is pretty minimal. The only compiled code in testthat
that should be executed is a function that finds a label -- but that
happens only after an error occurs.

This is the sessionInfo():
R Under development (unstable) (2017-03-23 r72389)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pool_0.1.0     DBI_0.6-1      testthat_1.0.2

loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5   R6_2.2.0       crayon_1.3.2


I have spent days trying to make a minimal example, and that is the
best that I have been able to do so far. I would not involve all this
complexity if I could avoid it. The problem is that the behavior is
extremely sensitive to any changes. Seemingly-innocuous things like
removing other tests, or adding a cat() statement can make the error
disappear, or in some cases it makes ls() return values from a
different (previous) iteration of the loop.


In my testing, I have also seen a case where calls to `cat( ... ,
file=stderr())` result in output that has the wrong order. I don't
know if that's due to the cat() calls being executed in the wrong
order, or if it's simply being printed or buffered in the wrong order.

This is the code in question that cat()s to stderr:
https://github.com/rstudio/pool/blob/0724ad9/R/scheduler.R#L74-L90
      while (TRUE) {
        tasks <- sort(ls(private$scheduledTasks))
        if (length(tasks) == 0) break
        t <- tasks[[1]]

        s <- stderr()
        cat(tasks, "--1--\n", file = s)
        cat(ls(private$scheduledTasks), "--2--\n", file = s)
        cat(t, "--3--\n", file = s)

        task <- private$scheduledTasks[[t]]
        rm(list = t, envir = private$scheduledTasks)

        task()
      }

Without going into too much detail, it should print lines of text that
end with --1--, --2--, --3--, and repeat. Here's what it prints
instead when running the tests:

20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --1--
20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --1--
20170405-182549.466875-18 20170405-182559.456628-17
20170405-182729.456318-16 --2--
20170405-182549.466875-18 --3--
20170405-182559.456628-17 20170405-182729.456318-16 --1--
20170405-182559.456628-17 20170405-182729.456318-16 --2--
20170405-182559.456628-17 --3--
20170405-182729.456318-16 --1--
20170405-182729.456318-16 --2--
20170405-182729.456318-16 --3--
 --2--
20170405-182549.466875-18 --3--
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
       scheduleTask(1e+05, function() {
           results <<- c(results, 3L)
       })
       scheduleTask(10000, function() {
           results <<- c(results, 2L)
       })
       scheduleTask(10, function() {
           results <<- c(results, 1L)
       })
   }) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()

It's almost as though, in the middle of the first iteration of the
while loop, R jumps to the next iteration of the loop, runs the loop a
couple of times to completion, and then returns to the first iteration
of the loop at the place that it left.

This can be reproduced by following the instructions in this gist:
  https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88

Almost any change to the code will make the error disappear, or change
to a different one.


With regard to reproducing it in "base" R: I made a simple example
using just R (no other packages) that does something similar to what
happens in the tests, but even when I run it for 100,000 iterations,
the error doesn't occur.

I think there's a good chance that this is due to a bug in R. I have
been trying to track down the cause of the problem but haven't been
able find it.

-Winston

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
On Wed, Apr 5, 2017 at 2:24 PM, Robert McGehee <[hidden email]>
wrote:

> Winston,
> I had a similar experience to you tracking down an insanely difficult bug
> in my R code that "disappeared" whenever slight changes were made to the
> script (e.g. like adding cat() statements). In my case, it coincided with
> my over-eager compilation of R and its library stack, as I was also
> experimenting with a cutting edge version of the gcc compiler as well as
> what I thought were innocuous performance enhancing CFLAGS like -O3/-Ofast
> -march=native. After downgrading gcc and recompiling everything (R and
> BLAS) without the extra flags, the problem went away. Not saying this is
> your problem, just sharing my similar experience.
>
>
Thanks Robert. I'm glad that I'm not the only one who's encountered an
issue like this. "Insanely difficult" is an apt description. :)

I've been using the rocker/r-devel for testing. It compiles R with the
following CFLAGS:
  -g -O2 -fdebug-prefix-map=/build/r-base-3.3.3=. -fstack-protector-strong
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g
It looks like it gets those settings from running R CMD config CFLAGS with
the already-installed version of R (3.3.3) which comes from a .deb package.
  https://github.com/rocker-org/rocker/blob/master/r-devel/Dockerfile#L76


I've also compiled R (again, in Docker) and tested with that, and gotten
the same results. It basically uses just `./configure
--without-recommended-packages`
and then `make`.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
In reply to this post by Robert McGehee-2
I just tried recompiling R with no -O flag, and I still get the same error.
Here are the CFLAGS (the RD program runs R-devel instead of R 3.3.3):

# RD CMD config CFLAGS
-g -fdebug-prefix-map=/build/r-base-3.3.3=. -fstack-protector-strong
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g

-Winston


On Wed, Apr 5, 2017 at 2:24 PM, Robert McGehee <[hidden email]>
wrote:

> Winston,
> I had a similar experience to you tracking down an insanely difficult bug
> in my R code that "disappeared" whenever slight changes were made to the
> script (e.g. like adding cat() statements). In my case, it coincided with
> my over-eager compilation of R and its library stack, as I was also
> experimenting with a cutting edge version of the gcc compiler as well as
> what I thought were innocuous performance enhancing CFLAGS like -O3/-Ofast
> -march=native. After downgrading gcc and recompiling everything (R and
> BLAS) without the extra flags, the problem went away. Not saying this is
> your problem, just sharing my similar experience.
>
> <TANGENT> And for anyone interested, I did extensive benchmarking on the
> effects of the added CFLAGS and cutting edge gcc compilers, and I never saw
> any significant performance enhancement, and frequently saw a big
> performance penalty with extra flags, particularly as matrix algebra
> problems sometimes slowed down enormously when using a custom BLAS (ATLAS)
> compiled with anything fancy. Though nowadays, the out-of-the-box R BLAS
> seems much faster than it used to be, so I don't even bother fiddling with
> a custom BLAS. </TANGENT>
>
> --Robert
>
>
> -----Original Message-----
> From: R-devel [mailto:[hidden email]] On Behalf Of Winston
> Chang
> Sent: Wednesday, April 05, 2017 2:41 PM
> To: Martin Maechler <[hidden email]>
> Cc: R Development <[hidden email]>
> Subject: Re: [Rd] Very hard to reproduce bug (?) in R-devel
>
> )
>
> On Wed, Apr 5, 2017 at 2:59 AM, Martin Maechler
> <[hidden email]> wrote:
> >
> > >>>>> Winston Chang <[hidden email]>
> > >>>>>     on Tue, 4 Apr 2017 15:29:40 -0500 writes:
> >
> >     > I've done some more investigation into the problem, and it is very
> >     > difficult to pin down. What it looks like is happening is roughly
> like this:
> >     > - `p` is an environment and `p$e` is also an environment.
> >     > - There is a loop. In each iteration, it looks for one item in
> `p$e`, saves
> >     > it in a variable `x`, then removes that item from `p$e`. Then it
> invokes
> >     > `x()`. The loop runs again, until there are no more items in `p$e`.
> >
> >     > The problem is that `ls(p$e)` sometimes returns the wrong values
> -- it
> >     > returns the values that it had in previous iterations of the loop.
> The
> >     > behavior is very touchy. Almost any change to the code will
> slightly change
> >     > the behavior; sometimes the `ls()` returns values from a different
> >     > iteration of the loop, and sometimes the problem doesn't happen at
> all.
> >
> >     > I've put a  Dockerfile and instructions for reproducing the
> problem here:
> >     > https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88
> >
> >     > I think that I've gotten about as far with this as I can, though
> I'd be
> >     > happy to provide more information if anyone wants to take look at
> the
> >     > problem.
> >
> > Dear Winston,
> >
> > While I agree this may very well be a bug in R(-devel), and hence
> > also R in 3.4.0 alpha and hence quite important to be dealt with,
> >
> > your code still involves 3 non-trivial  packages (DBI, R6,
> > testthat) some of which have their own C code and notably load
> > a couple of other package's namespaces.
> > We've always made a point
> >       https://www.r-project.org/bugs.html
> > that bugs in R should be reproducible without extra
> > packages... and I think it would definitely help to pinpoint the
> > issue to be seen outside of your extra packages' world.
> >
> > Or have you been aware of that and are just asking for help
> > finding a bug in one of the extra packages involved, a bug that might
> only be triggered by recent changes in R ?
> >
> > OTOH, what you describe above  (p ; p$e ; p$e$x ...)
> > should be reproducible in pure "base" R code, right?
> >
> > I'm sorry not to be of more help
> > Martin
>
>
> Of the four packages that are loaded when running the tests (pool,
> DBI, R6, testthat, magrittr, crayon), only testthat contains compiled
> code, and it is pretty minimal. The only compiled code in testthat
> that should be executed is a function that finds a label -- but that
> happens only after an error occurs.
>
> This is the sessionInfo():
> R Under development (unstable) (2017-03-23 r72389)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Debian GNU/Linux 9 (stretch)
>
> Matrix products: default
> BLAS: /usr/local/lib/R/lib/libRblas.so
> LAPACK: /usr/local/lib/R/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] pool_0.1.0     DBI_0.6-1      testthat_1.0.2
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0 magrittr_1.5   R6_2.2.0       crayon_1.3.2
>
>
> I have spent days trying to make a minimal example, and that is the
> best that I have been able to do so far. I would not involve all this
> complexity if I could avoid it. The problem is that the behavior is
> extremely sensitive to any changes. Seemingly-innocuous things like
> removing other tests, or adding a cat() statement can make the error
> disappear, or in some cases it makes ls() return values from a
> different (previous) iteration of the loop.
>
>
> In my testing, I have also seen a case where calls to `cat( ... ,
> file=stderr())` result in output that has the wrong order. I don't
> know if that's due to the cat() calls being executed in the wrong
> order, or if it's simply being printed or buffered in the wrong order.
>
> This is the code in question that cat()s to stderr:
> https://github.com/rstudio/pool/blob/0724ad9/R/scheduler.R#L74-L90
>       while (TRUE) {
>         tasks <- sort(ls(private$scheduledTasks))
>         if (length(tasks) == 0) break
>         t <- tasks[[1]]
>
>         s <- stderr()
>         cat(tasks, "--1--\n", file = s)
>         cat(ls(private$scheduledTasks), "--2--\n", file = s)
>         cat(t, "--3--\n", file = s)
>
>         task <- private$scheduledTasks[[t]]
>         rm(list = t, envir = private$scheduledTasks)
>
>         task()
>       }
>
> Without going into too much detail, it should print lines of text that
> end with --1--, --2--, --3--, and repeat. Here's what it prints
> instead when running the tests:
>
> 20170405-182549.466875-18 20170405-182559.456628-17
> 20170405-182729.456318-16 --1--
> 20170405-182549.466875-18 20170405-182559.456628-17
> 20170405-182729.456318-16 --1--
> 20170405-182549.466875-18 20170405-182559.456628-17
> 20170405-182729.456318-16 --2--
> 20170405-182549.466875-18 --3--
> 20170405-182559.456628-17 20170405-182729.456318-16 --1--
> 20170405-182559.456628-17 20170405-182729.456318-16 --2--
> 20170405-182559.456628-17 --3--
> 20170405-182729.456318-16 --1--
> 20170405-182729.456318-16 --2--
> 20170405-182729.456318-16 --3--
>  --2--
> 20170405-182549.466875-18 --3--
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
>        scheduleTask(1e+05, function() {
>            results <<- c(results, 3L)
>        })
>        scheduleTask(10000, function() {
>            results <<- c(results, 2L)
>        })
>        scheduleTask(10, function() {
>            results <<- c(results, 1L)
>        })
>    }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
>
> It's almost as though, in the middle of the first iteration of the
> while loop, R jumps to the next iteration of the loop, runs the loop a
> couple of times to completion, and then returns to the first iteration
> of the loop at the place that it left.
>
> This can be reproduced by following the instructions in this gist:
>   https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88
>
> Almost any change to the code will make the error disappear, or change
> to a different one.
>
>
> With regard to reproducing it in "base" R: I made a simple example
> using just R (no other packages) that does something similar to what
> happens in the tests, but even when I run it for 100,000 iterations,
> the error doesn't occur.
>
> I think there's a good chance that this is due to a bug in R. I have
> been trying to track down the cause of the problem but haven't been
> able find it.
>
> -Winston
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Dirk Eddelbuettel
In reply to this post by Winston Chang

On 5 April 2017 at 15:46, Winston Chang wrote:
| On Wed, Apr 5, 2017 at 2:24 PM, Robert McGehee <[hidden email]>
| wrote:
|
| > Winston,
| > I had a similar experience to you tracking down an insanely difficult bug
| > in my R code that "disappeared" whenever slight changes were made to the
| > script (e.g. like adding cat() statements). In my case, it coincided with
| > my over-eager compilation of R and its library stack, as I was also
| > experimenting with a cutting edge version of the gcc compiler as well as
| > what I thought were innocuous performance enhancing CFLAGS like -O3/-Ofast
| > -march=native. After downgrading gcc and recompiling everything (R and
| > BLAS) without the extra flags, the problem went away. Not saying this is
| > your problem, just sharing my similar experience.
| >
| >
| Thanks Robert. I'm glad that I'm not the only one who's encountered an
| issue like this. "Insanely difficult" is an apt description. :)
|
| I've been using the rocker/r-devel for testing. It compiles R with the
| following CFLAGS:
|   -g -O2 -fdebug-prefix-map=/build/r-base-3.3.3=. -fstack-protector-strong
| -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g
| It looks like it gets those settings from running R CMD config CFLAGS with
| the already-installed version of R (3.3.3) which comes from a .deb package.
|   https://github.com/rocker-org/rocker/blob/master/r-devel/Dockerfile#L76

That's a Debian default a Policy-conforming offical package must use.

For Docker you get these from the (prebuild) .deb package, you get it when
you do r-devel by hand as they come back in via R CMD CONFIG:

  https://github.com/rocker-org/rocker/blob/master/r-devel/Dockerfile#L76-L77

You could undo those two lines / set CFLAGS and CXXFLAGS by hand.
 
| I've also compiled R (again, in Docker) and tested with that, and gotten
| the same results. It basically uses just `./configure
| --without-recommended-packages`
| and then `make`.

Then these options should not come in.  But are you saying that the Heisenbug
behaviour happens irrespective of the compilation flags?

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
Hi Dirk, I sent another message just before yours, so you may not have seen
it:

===========

I just tried recompiling R with no -O flag, and I still get the same error.
Here are the CFLAGS (the RD program runs R-devel instead of R 3.3.3):

# RD CMD config CFLAGS
-g -fdebug-prefix-map=/build/r-base-3.3.3=. -fstack-protector-strong
-Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g

===========

Additionally, I tried again with the rocker/r-devel-san image, and got the
same error. There was no other output from the sanitizers. In this
container:

# RD CMD config CFLAGS
-pipe -Wall -pedantic -O2 -mtune=native -fsanitize=address

The output looks like this (it should be in order --1--, --2--, --3--,
repeat):
20170405-213410.095858-18 20170405-213420.085451-17
20170405-213550.084947-16 --1--
20170405-213410.095858-18 20170405-213420.085451-17
20170405-213550.084947-16 --1--
20170405-213410.095858-18 20170405-213420.085451-17
20170405-213550.084947-16 --2--
20170405-213410.095858-18 --3--
20170405-213420.085451-17 20170405-213550.084947-16 --1--
20170405-213420.085451-17 20170405-213550.084947-16 --2--
20170405-213420.085451-17 --3--
20170405-213550.084947-16 --1--
20170405-213550.084947-16 --2--
20170405-213550.084947-16 --3--
 --2--
20170405-213410.095858-18 --3--
1. Error: pool scheduler: schedules things in the right order
(@test-scheduling.R#13)
could not find function "task"
1: naiveScheduler$protect({
       scheduleTask(1e+05, function() {
           results <<- c(results, 3L)
       })
       scheduleTask(10000, function() {
           results <<- c(results, 2L)
       })
       scheduleTask(10, function() {
           results <<- c(results, 1L)
       })
   }) at testthat/test-scheduling.R:13
2: private$refCount$release() at testthat/test-scheduling.R:13
3: private$callback()

-Winston


On Wed, Apr 5, 2017 at 4:32 PM, Dirk Eddelbuettel <[hidden email]> wrote:

>
> On 5 April 2017 at 15:46, Winston Chang wrote:
> | On Wed, Apr 5, 2017 at 2:24 PM, Robert McGehee <
> [hidden email]>
> | wrote:
> |
> | > Winston,
> | > I had a similar experience to you tracking down an insanely difficult
> bug
> | > in my R code that "disappeared" whenever slight changes were made to
> the
> | > script (e.g. like adding cat() statements). In my case, it coincided
> with
> | > my over-eager compilation of R and its library stack, as I was also
> | > experimenting with a cutting edge version of the gcc compiler as well
> as
> | > what I thought were innocuous performance enhancing CFLAGS like
> -O3/-Ofast
> | > -march=native. After downgrading gcc and recompiling everything (R and
> | > BLAS) without the extra flags, the problem went away. Not saying this
> is
> | > your problem, just sharing my similar experience.
> | >
> | >
> | Thanks Robert. I'm glad that I'm not the only one who's encountered an
> | issue like this. "Insanely difficult" is an apt description. :)
> |
> | I've been using the rocker/r-devel for testing. It compiles R with the
> | following CFLAGS:
> |   -g -O2 -fdebug-prefix-map=/build/r-base-3.3.3=.
> -fstack-protector-strong
> | -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g
> | It looks like it gets those settings from running R CMD config CFLAGS
> with
> | the already-installed version of R (3.3.3) which comes from a .deb
> package.
> |   https://github.com/rocker-org/rocker/blob/master/r-devel/
> Dockerfile#L76
>
> That's a Debian default a Policy-conforming offical package must use.
>
> For Docker you get these from the (prebuild) .deb package, you get it when
> you do r-devel by hand as they come back in via R CMD CONFIG:
>
>   https://github.com/rocker-org/rocker/blob/master/r-devel/
> Dockerfile#L76-L77
>
> You could undo those two lines / set CFLAGS and CXXFLAGS by hand.
>
> | I've also compiled R (again, in Docker) and tested with that, and gotten
> | the same results. It basically uses just `./configure
> | --without-recommended-packages`
> | and then `make`.
>
> Then these options should not come in.  But are you saying that the
> Heisenbug
> behaviour happens irrespective of the compilation flags?
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Peter Dalgaard-2
In reply to this post by Winston Chang

> On 05 Apr 2017, at 20:40 , Winston Chang <[hidden email]> wrote:
>
> I think there's a good chance that this is due to a bug in R. I have
> been trying to track down the cause of the problem but haven't been
> able find it.
>
> -Winston

Apologies in advance if this is just stating the obvious, but let me try and put some general ideas  on the table.

- is anything non-deterministic involved? (Doesn't sound so, but...)
- could it be something with the bytecompiler?
- can you get something (_anything_) to trigger the bug (in any variant) when running R under gdb? I'm thinking gctorture() etc.
- it is odd that you cannot immediately get the same behaviour with R -d gdb or valgrind. Are you sure you are actually running the same script in the same way?
- if you can get a hold of something inside gdb, then there should be some potential for backtracking using hardware watchpoints and such. As in: This memory location doesn't contain the value I expected; what changed it?

-pd


--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Uwe Ligges-3


On 05.04.2017 23:54, peter dalgaard wrote:

>
>> On 05 Apr 2017, at 20:40 , Winston Chang <[hidden email]> wrote:
>>
>> I think there's a good chance that this is due to a bug in R. I have
>> been trying to track down the cause of the problem but haven't been
>> able find it.
>>
>> -Winston
>
> Apologies in advance if this is just stating the obvious, but let me try and put some general ideas  on the table.
>
> - is anything non-deterministic involved? (Doesn't sound so, but...)
> - could it be something with the bytecompiler?

Also my suspicion, can you try without having JIT enabled?

Best,
Uwe Ligges



> - can you get something (_anything_) to trigger the bug (in any variant) when running R under gdb? I'm thinking gctorture() etc.
> - it is odd that you cannot immediately get the same behaviour with R -d gdb or valgrind. Are you sure you are actually running the same script in the same way?
> - if you can get a hold of something inside gdb, then there should be some potential for backtracking using hardware watchpoints and such. As in: This memory location doesn't contain the value I expected; what changed it?
>
> -pd
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
In reply to this post by Peter Dalgaard-2
>
> Apologies in advance if this is just stating the obvious, but let me try
> and put some general ideas  on the table.


These are great ideas, thanks.



> - is anything non-deterministic involved? (Doesn't sound so, but...)
>

There was an environment where items were added, and the names of the items
had timestamps. However, I just modified that code to use deterministic
names and the error still happened.


- could it be something with the bytecompiler?
>

I've tried two things. The first was to install the pool package with
--no-byte-compile. In this case the error still happens.

The second was to compile R with `./configure
--enable-byte-compiled-packages=no`.
When I do this, the error does NOT happen. I've tried varying the pool code
in a few different ways to try to provoke the error, but I have not been
able to get it to happen. So it is possible that the compiled base R
packages play some part here.



> - can you get something (_anything_) to trigger the bug (in any variant)
> when running R under gdb? I'm thinking gctorture() etc.
>

Some variations of the code will error without gdb, but will not error with
gdb. I twiddled with the code a bit and now the current version of the code
(f97cfdf) will error under gdb.

I've also run it with gctorture(T) and have not seen this error with that
enabled, but I haven't tested it extensively in this mode. (In a previous
email I mentioned that with gctorture on, I got three different errors in
the tests. I later found that these errors were due to tests having
incorrect assumptions. For example, one test called gc() and expected a
warning, but it incorrectly assumed that a GC event would not have occurred
slightly earlier.)



> - it is odd that you cannot immediately get the same behaviour with R -d
> gdb or valgrind. Are you sure you are actually running the same script in
> the same way?
>

Some versions of the code, but not all, will give the same error under gdb
and valgrind. See above.



> - if you can get a hold of something inside gdb, then there should be some
> potential for backtracking using hardware watchpoints and such. As in: This
> memory location doesn't contain the value I expected; what changed it?
>

I probably don't know enough about R internals or gdb to be useful here.
But if someone wants to try it out, reproducing it as simple as copying and
pasting the first two blocks of code from the README here (assuming you
have Docker installed):
  https://gist.github.com/wch/2596a1c9f1bcdee91bb210c782141c88
It will build a Docker image with the appropriate software installed, and
then run the tests. The README also shows how to run it with gdb.

-Winston

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
In reply to this post by Uwe Ligges-3
>
> Also my suspicion, can you try without having JIT enabled?
>

The results for different JIT levels. I ran compiler::enableJIT() before
sourcing the test file:
3: error
2: error
1: OK
0: OK

-Winston

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Very hard to reproduce bug (?) in R-devel

Winston Chang
In reply to this post by Winston Chang
With Luke Tierney's assistance, we found that the root cause of the problem
was not R, but a race condition in the pool package that occurred when a
finalizer ran.

Below is a very contrived and simplified example to illustrate:

gc()
n <- 1

increment_n <- function() {
  cat(n, "-", sep = "")
  # Do some stuff that allocates memory. Sometimes this triggers a GC.
  lapply(rnorm(2e4), mean)
  cat(n, "\n")
  n <<- n + 1
}

# Create an object, and add a finalizer that simply calls increment_n
e <- new.env()
reg.finalizer(e, function(e) { increment_n() })
rm(e)

for (i in 1:5) {
  increment_n()
}

On the surface, it looks like this should print out "1-1", "2-2", and so
on, on separate lines. But when I run it (on my Mac), it outputs this:
1-1
2-2
3-3-3
4
5-5
6-6


On the positive side, there are some lessons that can be taken from this.
If you encounter a seemingly random and hard-to-diagnose problem, it could
be due to a finalizer. Most of us think about our R code as straight line
code that runs on a single thread. We don't even consider things like race
conditions, since they are problems usually encountered in multithreaded
environments. But finalizers violate this mental model. Because they can
unexpectedly interrupt your code and execute other code, you can end up
with some of the same problems that you would face in a multithreaded
environment.

Thanks to everyone who spent time helping to diagnose the problem!
-Winston


On Mon, Apr 3, 2017 at 10:34 AM, Winston Chang <[hidden email]>
wrote:

> When running R CMD check on a package, we are encountering an error on
> R-devel (as of 72457) on Linux. Unfortunately, it is very hard to
> reproduce, and almost any change to the code makes the error go away.
> I believe that this is due to a bug in R-devel, which has been present
> since at least commit 72128 (on 2017-02-06).
>
> The test error occurs when R CMD check is run on Travis CI (on Ubuntu
> 12.04.05), and locally on R-devel (on Debian testing). If we make any
> of the following changes, the error does not occur, and the test
> passes:
> - Run the test scripts from an interactive R session
> - Run on a Mac instead of Linux
> - Use R 3.3.3 or 3.2.5
> - Add code right before the tests. For example, just running gc() or
> format(1) right before the test causes it to not fail. However, adding
> very simple code like 1+1 does not affect the test (it still fails).
> - Run R CMD check with -d gdb or -d valgrind
>
> Because the test is so sensitive to any changes, we've been unable to
> track down the source of the problem. We can't be sure, but the error
> message suggests that a promise is getting evaluated with the wrong
> environment.
>
> Here's the relevant output from the tests:
> 1. Error: pool scheduler: schedules things in the right order
> (@test-scheduling.R#13)
> could not find function "task"
> 1: naiveScheduler$protect({
>        scheduleTask(1000, function() {
>            results <<- c(results, 3L)
>        })
>        scheduleTask(100, function() {
>            results <<- c(results, 2L)
>        })
>        scheduleTask(10, function() {
>            results <<- c(results, 1L)
>        })
>    }) at testthat/test-scheduling.R:13
> 2: private$refCount$release() at testthat/test-scheduling.R:13
> 3: private$callback()
> 4: self$protect({
>        task()
>    })
> 5: private$refCount$release()
> 6: private$callback()
> 7: self$protect({
>        task()
>    })
> 8: force(expr)
>
> You can see complete logs on Travis here:
>   https://travis-ci.org/rstudio/pool/jobs/217301804#L747-L771
>
> In the test, a re-entrant function is called (self$protect, at levels
> 4 and 7 on the call stack), and in the inner call (7), a variable,
> `task`, does not have the correct value. It should be a function, but
> it isn't; it's either NULL or some other non-function value.
>
> That code is here:
>   https://github.com/rstudio/pool/blob/f7e52a3/R/scheduler.R#L79-L81
> Note that there's a `force(task)` a couple lines above, which is
> commented out. If that is uncommented, the test passes, even though
> calling force() there shouldn't make a difference. Also, calling
> `format(1)` at that location also causes the test to pass.
>
>
> I've written instructions to reproduce the problem in a Docker container
> here:
>   https://gist.github.com/wch/4cd0c56757886c63522049a32b3c294f
>
> If you want to try reproducing the error without using Docker, I
> believe if you download pool_0.1.0.tar.gz file from the
> gist.github.com URL above, and run R CMD check on it, that will do it.
> Some upstream dependencies will likely need to be installed; they are
> listed in the code in the gist.
>
>
> We also used git bisect to try to find the first commit in R where the
> test would fail. The instructions for doing this are also in the gist
> linked to above. It found that the first commit where the test failed
> was 72128:
>   https://github.com/wch/r-source/commit/7c950ac11202cabc25b276c2f4da82
> 226644953d
>
> Taking a look at the changes in that commit, however, it seems very
> unlikely that it caused the error. Instead, it probably just made a
> change which allowed an already-existing bug to manifest.
>
>
> We'd appreciate any help finding the cause of this bug!
> -Winston
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Loading...