SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Henrik Bengtsson-5
On Fri, Jan 10, 2020 at 11:23 AM Simon Urbanek
<[hidden email]> wrote:
>
> Henrik,
>
> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.

Thanks for ruling that example out.

> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.

I think this is worth pursuing and will help improve and stabilize
things.  But issuing a warning or stop on fork will not allow end
users from running the pipeline, or am I missing something?

I'm trying to argue that this is still a real problem that users and
developers run into on a regular basis.  Since parallel::mclapply() is
such a common and readily available solution it is also a low hanging
fruit to make it possible to have those forking functions fall back to
sequential processing.  The only(*) way to achieve this fall back
right now is to run the same pipeline on MS Windows - I just think it
would be very useful to have the same fallback option available on
Unix and macOS.  Having this in base R could also serve as standard
for other parallel/forking packages/implementations who also wish to
have a fallback to sequential processing.

==> What would the disadvantages be to provide a mechanism/setting for
disabling forking in the parallel::mc*** API? <==

(*) One can also somewhat disable forking in 'parallel' by using
'cgroups' limiting the process to a single core (see also
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17641).  That will
handle code that uses mc.cores = parallel::detectCores(), which there
is a lot of.  I guess it will cause run-time error (on 'mc.cores' must
be >= 1) for code that uses the second most common used mc.cores =
parallel::detectCores() - 1, which is unfortunately also very common.
I find the use of hardcoded detectCores() unfortunate but that is a
slightly different topic.  OTH, if there would a standardized option
in R for disabling all types of parallel processing by forcing a
single core, one could imagine other parallel APIs to implement
fallbacks to sequential processing as well. (I'm aware that not all
use cases of async processing is about parallelization, so it might
not apply everywhere).

Cheers,

Henrik

>
> Cheers,
> Simon
>
>
>
> > On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
> >
> > The RStudio GUI was just one example.  AFAIK, and please correct me if
> > I'm wrong, another example is where multi-threaded code is used in
> > forked processing and that's sometimes unstable.  Yes another, which
> > might be multi-thread related or not, is
> > https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
> >
> > res <- parallel::mclapply(urls, function(url) {
> >  download.file(url, basename(url))
> > })
> >
> > That was reported to fail on macOS with the default method="libcurl"
> > but not for method="curl" or method="wget".
> >
> > Further documentation is needed and would help but I don't believe
> > it's sufficient to solve everyday problems.  The argument for
> > introducing an option/env var to disable forking is to give the end
> > user a quick workaround for newly introduced bugs.  Neither the
> > develop nor the end user have full control of the R package stack,
> > which is always in flux.  For instance, above mclapply() code might
> > have been in a package on CRAN and then all of a sudden
> > method="libcurl" became the new default in base R.  The above
> > mclapply() code is now buggy on macOS, and not necessarily caught by
> > CRAN checks.  The package developer might not notice this because they
> > are on Linux or Windows.  It can take a very long time before this
> > problem is even noticed and even further before it is tracked down and
> > fixed.   Similarly, as more and more code turn to native code and it
> > becomes easier and easier to implement multi-threading, more and more
> > of these bugs across package dependencies risk sneaking in the
> > backdoor wherever forked processing is in place.
> >
> > For the end user, but also higher-up upstream package developers, the
> > quickest workaround would be disable forking.  If you're conservative,
> > you could even disable it all of your R processing.  Being able to
> > quickly disable forking will also provide a mechanism for quickly
> > testing the hypothesis that forking is the underlying problem, i.e.
> > "Please retry with options(fork.allowed = FALSE)" will become handy
> > for troubleshooting.
> >
> > /Henrik
> >
> > On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
> > <[hidden email]> wrote:
> >>
> >> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
> >>
> >> Cheers,
> >> Simon
> >>
> >> Sent from my iPhone
> >>
> >>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
> >>>>
> >>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> >>>> I'd like to pick up this thread started on 2019-04-11
> >>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> >>>> Modulo all the other suggestions in this thread, would my proposal of
> >>>> being able to disable forked processing via an option or an
> >>>> environment variable make sense?
> >>>
> >>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
> >>>
> >>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
> >>>
> >>> Tomas
> >>>
> >>>> I've prototyped a working patch that
> >>>> works like:
> >>>>> options(fork.allowed = FALSE)
> >>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> >>>> [1] 14058 14058
> >>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> >>>> [1] 14058 14058
> >>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> >>>> [1] 14058.1 14058.2
> >>>>> f <- parallel::mcparallel(Sys.getpid())
> >>>> Error in allowFork(assert = TRUE) :
> >>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>> environment variable ‘R_FORK_ALLOWED’
> >>>>> cl <- parallel::makeForkCluster(1L)
> >>>> Error in allowFork(assert = TRUE) :
> >>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>> environment variable ‘R_FORK_ALLOWED’
> >>>> The patch is:
> >>>> Index: src/library/parallel/R/unix/forkCluster.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> >>>> @@ -30,6 +30,7 @@
> >>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
> >>>> {
> >>>> +    allowFork(assert = TRUE)
> >>>>    options <- addClusterOptions(options, list(...))
> >>>>    outfile <- getClusterOption("outfile", options)
> >>>>    port <- getClusterOption("port", options)
> >>>> Index: src/library/parallel/R/unix/mclapply.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> >>>> @@ -28,7 +28,7 @@
> >>>>        stop("'mc.cores' must be >= 1")
> >>>>    .check_ncores(cores)
> >>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
> >>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
> >>>>        return(lapply(X = X, FUN = FUN, ...))
> >>>>    ## Follow lapply
> >>>> Index: src/library/parallel/R/unix/mcparallel.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> >>>> @@ -20,6 +20,7 @@
> >>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> >>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
> >>>> {
> >>>> +    allowFork(assert = TRUE)
> >>>>    f <- mcfork(detached)
> >>>>    env <- parent.frame()
> >>>>    if (isTRUE(mc.set.seed)) mc.advance.stream()
> >>>> Index: src/library/parallel/R/unix/pvec.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
> >>>> @@ -25,7 +25,7 @@
> >>>>    cores <- as.integer(mc.cores)
> >>>>    if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>> -    if(cores == 1L) return(FUN(v, ...))
> >>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
> >>>>    .check_ncores(cores)
> >>>>    if(mc.set.seed) mc.reset.stream()
> >>>> with a new file src/library/parallel/R/unix/allowFork.R:
> >>>> allowFork <- function(assert = FALSE) {
> >>>>   value <- Sys.getenv("R_FORK_ALLOWED")
> >>>>   if (nzchar(value)) {
> >>>>       value <- switch(value,
> >>>>          "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
> >>>>          "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
> >>>>           stop(gettextf("invalid environment variable value: %s==%s",
> >>>>          "R_FORK_ALLOWED", value)))
> >>>> value <- as.logical(value)
> >>>>   } else {
> >>>>       value <- TRUE
> >>>>   }
> >>>>   value <- getOption("fork.allowed", value)
> >>>>   if (is.na(value)) {
> >>>>       stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
> >>>>   }
> >>>>   if (assert && !value) {
> >>>>     stop(gettextf("Forked processing is not allowed per option %s or
> >>>> environment variable %s", sQuote("fork.allowed"),
> >>>> sQuote("R_FORK_ALLOWED")))
> >>>>   }
> >>>>   value
> >>>> }
> >>>> /Henrik
> >>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
> >>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> >>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>>>>>>> I think it's worth saying that mclapply() works as documented
> >>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>>>>>>> overcommitment, and that this means that it may work nicely or fail
> >>>>>>>> spectacularly depending on whether, e.g., you operate on a long
> >>>>>>>> vector.
> >>>>>>> R cannot possibly replicate documentation of the underlying operating
> >>>>>>> systems. It clearly says that fork() is used and readers who may not
> >>>>>>> know what fork() is need to learn it from external sources.
> >>>>>>> Copy-on-write is an elementary property of fork().
> >>>>>> Just to be precise, copy-on-write is an optimization widely deployed
> >>>>>> in most modern *nixes, particularly for the architectures in which R
> >>>>>> usually runs. But it is not an elementary property; it is not even
> >>>>>> possible without an MMU.
> >>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
> >>>>> Not relevant today, and certainly not for systems that run R, but indeed
> >>>>> people interested in OS internals can look elsewhere for more precise
> >>>>> information.
> >>>>> Tomas
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Gábor Csárdi
In reply to this post by Simon Urbanek
On Fri, Jan 10, 2020 at 7:23 PM Simon Urbanek
<[hidden email]> wrote:
>
> Henrik,
>
> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries. That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.

Btw. what other packages do fork() without an exec() call()?

Gbaor
[...]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
In reply to this post by Henrik Bengtsson-5
Henrik,

the whole point and only purpose of mc* functions is to fork. That's what the multicore package was about, so if you don't want to fork, don't use mc* functions - they don't have any other purpose. I really fail to see the point - if you use mc* functions you're very explicitly asking for forking - so your argument is like saying that print() should have an option to not print anything - it just makes no sense. If you have code that is fork-incompatilble, you clearly cannot use it in mcparallel - that's why there is a very explicit warning in the documentation. As I said, if you have some software that embeds R and has issue with forks, then that software should be use pthread_atfork() to control the behavior.

Cheers,
Simon



> On Jan 10, 2020, at 3:34 PM, Henrik Bengtsson <[hidden email]> wrote:
>
> On Fri, Jan 10, 2020 at 11:23 AM Simon Urbanek
> <[hidden email]> wrote:
>>
>> Henrik,
>>
>> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.
>
> Thanks for ruling that example out.
>
>> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
>
> I think this is worth pursuing and will help improve and stabilize
> things.  But issuing a warning or stop on fork will not allow end
> users from running the pipeline, or am I missing something?
>
> I'm trying to argue that this is still a real problem that users and
> developers run into on a regular basis.  Since parallel::mclapply() is
> such a common and readily available solution it is also a low hanging
> fruit to make it possible to have those forking functions fall back to
> sequential processing.  The only(*) way to achieve this fall back
> right now is to run the same pipeline on MS Windows - I just think it
> would be very useful to have the same fallback option available on
> Unix and macOS.  Having this in base R could also serve as standard
> for other parallel/forking packages/implementations who also wish to
> have a fallback to sequential processing.
>
> ==> What would the disadvantages be to provide a mechanism/setting for
> disabling forking in the parallel::mc*** API? <==
>
> (*) One can also somewhat disable forking in 'parallel' by using
> 'cgroups' limiting the process to a single core (see also
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17641).  That will
> handle code that uses mc.cores = parallel::detectCores(), which there
> is a lot of.  I guess it will cause run-time error (on 'mc.cores' must
> be >= 1) for code that uses the second most common used mc.cores =
> parallel::detectCores() - 1, which is unfortunately also very common.
> I find the use of hardcoded detectCores() unfortunate but that is a
> slightly different topic.  OTH, if there would a standardized option
> in R for disabling all types of parallel processing by forcing a
> single core, one could imagine other parallel APIs to implement
> fallbacks to sequential processing as well. (I'm aware that not all
> use cases of async processing is about parallelization, so it might
> not apply everywhere).
>
> Cheers,
>
> Henrik
>
>>
>> Cheers,
>> Simon
>>
>>
>>
>>> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
>>>
>>> The RStudio GUI was just one example.  AFAIK, and please correct me if
>>> I'm wrong, another example is where multi-threaded code is used in
>>> forked processing and that's sometimes unstable.  Yes another, which
>>> might be multi-thread related or not, is
>>> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
>>>
>>> res <- parallel::mclapply(urls, function(url) {
>>> download.file(url, basename(url))
>>> })
>>>
>>> That was reported to fail on macOS with the default method="libcurl"
>>> but not for method="curl" or method="wget".
>>>
>>> Further documentation is needed and would help but I don't believe
>>> it's sufficient to solve everyday problems.  The argument for
>>> introducing an option/env var to disable forking is to give the end
>>> user a quick workaround for newly introduced bugs.  Neither the
>>> develop nor the end user have full control of the R package stack,
>>> which is always in flux.  For instance, above mclapply() code might
>>> have been in a package on CRAN and then all of a sudden
>>> method="libcurl" became the new default in base R.  The above
>>> mclapply() code is now buggy on macOS, and not necessarily caught by
>>> CRAN checks.  The package developer might not notice this because they
>>> are on Linux or Windows.  It can take a very long time before this
>>> problem is even noticed and even further before it is tracked down and
>>> fixed.   Similarly, as more and more code turn to native code and it
>>> becomes easier and easier to implement multi-threading, more and more
>>> of these bugs across package dependencies risk sneaking in the
>>> backdoor wherever forked processing is in place.
>>>
>>> For the end user, but also higher-up upstream package developers, the
>>> quickest workaround would be disable forking.  If you're conservative,
>>> you could even disable it all of your R processing.  Being able to
>>> quickly disable forking will also provide a mechanism for quickly
>>> testing the hypothesis that forking is the underlying problem, i.e.
>>> "Please retry with options(fork.allowed = FALSE)" will become handy
>>> for troubleshooting.
>>>
>>> /Henrik
>>>
>>> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
>>> <[hidden email]> wrote:
>>>>
>>>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>> Sent from my iPhone
>>>>
>>>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
>>>>>>
>>>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
>>>>>> I'd like to pick up this thread started on 2019-04-11
>>>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
>>>>>> Modulo all the other suggestions in this thread, would my proposal of
>>>>>> being able to disable forked processing via an option or an
>>>>>> environment variable make sense?
>>>>>
>>>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
>>>>>
>>>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
>>>>>
>>>>> Tomas
>>>>>
>>>>>> I've prototyped a working patch that
>>>>>> works like:
>>>>>>> options(fork.allowed = FALSE)
>>>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
>>>>>> [1] 14058 14058
>>>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
>>>>>> [1] 14058 14058
>>>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
>>>>>> [1] 14058.1 14058.2
>>>>>>> f <- parallel::mcparallel(Sys.getpid())
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>>> cl <- parallel::makeForkCluster(1L)
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>> The patch is:
>>>>>> Index: src/library/parallel/R/unix/forkCluster.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
>>>>>> @@ -30,6 +30,7 @@
>>>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   options <- addClusterOptions(options, list(...))
>>>>>>   outfile <- getClusterOption("outfile", options)
>>>>>>   port <- getClusterOption("port", options)
>>>>>> Index: src/library/parallel/R/unix/mclapply.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
>>>>>> @@ -28,7 +28,7 @@
>>>>>>       stop("'mc.cores' must be >= 1")
>>>>>>   .check_ncores(cores)
>>>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
>>>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>>>>>>       return(lapply(X = X, FUN = FUN, ...))
>>>>>>   ## Follow lapply
>>>>>> Index: src/library/parallel/R/unix/mcparallel.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
>>>>>> @@ -20,6 +20,7 @@
>>>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
>>>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   f <- mcfork(detached)
>>>>>>   env <- parent.frame()
>>>>>>   if (isTRUE(mc.set.seed)) mc.advance.stream()
>>>>>> Index: src/library/parallel/R/unix/pvec.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
>>>>>> @@ -25,7 +25,7 @@
>>>>>>   cores <- as.integer(mc.cores)
>>>>>>   if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>> -    if(cores == 1L) return(FUN(v, ...))
>>>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>>>>>>   .check_ncores(cores)
>>>>>>   if(mc.set.seed) mc.reset.stream()
>>>>>> with a new file src/library/parallel/R/unix/allowFork.R:
>>>>>> allowFork <- function(assert = FALSE) {
>>>>>>  value <- Sys.getenv("R_FORK_ALLOWED")
>>>>>>  if (nzchar(value)) {
>>>>>>      value <- switch(value,
>>>>>>         "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>>>>>>         "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>>>>>>          stop(gettextf("invalid environment variable value: %s==%s",
>>>>>>         "R_FORK_ALLOWED", value)))
>>>>>> value <- as.logical(value)
>>>>>>  } else {
>>>>>>      value <- TRUE
>>>>>>  }
>>>>>>  value <- getOption("fork.allowed", value)
>>>>>>  if (is.na(value)) {
>>>>>>      stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>>>>>>  }
>>>>>>  if (assert && !value) {
>>>>>>    stop(gettextf("Forked processing is not allowed per option %s or
>>>>>> environment variable %s", sQuote("fork.allowed"),
>>>>>> sQuote("R_FORK_ALLOWED")))
>>>>>>  }
>>>>>>  value
>>>>>> }
>>>>>> /Henrik
>>>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>>>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>>>>>>> I think it's worth saying that mclapply() works as documented
>>>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>>>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>>>>>>> vector.
>>>>>>>>> R cannot possibly replicate documentation of the underlying operating
>>>>>>>>> systems. It clearly says that fork() is used and readers who may not
>>>>>>>>> know what fork() is need to learn it from external sources.
>>>>>>>>> Copy-on-write is an elementary property of fork().
>>>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
>>>>>>>> in most modern *nixes, particularly for the architectures in which R
>>>>>>>> usually runs. But it is not an elementary property; it is not even
>>>>>>>> possible without an MMU.
>>>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>>>>>>> Not relevant today, and certainly not for systems that run R, but indeed
>>>>>>> people interested in OS internals can look elsewhere for more precise
>>>>>>> information.
>>>>>>> Tomas
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
In reply to this post by Gábor Csárdi


> On Jan 10, 2020, at 3:10 PM, Gábor Csárdi <[hidden email]> wrote:
>
> On Fri, Jan 10, 2020 at 7:23 PM Simon Urbanek
> <[hidden email]> wrote:
>>
>> Henrik,
>>
>> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.
>
> I think it works now, because Apple switched to a different SSL
> library for libcurl. It usually crashes or fails on older macOS
> versions, with the CRAN build of R as well.
>

That is not true - Apple has not changed the SSL back-end for many years. The issue in that post is presumably in the homebrew version of SSL.

Cheers,
Simon


> It is not a bug in any library, it is just that macOS does not support
> fork() without an immediate exec().
>
> In general, any code that calls the macOS system libraries might
> crash. (Except for CoreFoundation, which seems to be fine, but AFAIR
> there is no guarantee for that, either.)
>
> You get crashes in the terminal as well, without multithreading. E.g.
> the keyring package links for the Security library on macOS, so you
> get:
>
> ❯ R --vanilla -q
>> .libPaths("~/R/3.6")
>> keyring::key_list()[1:2,]
>        service                                                  username
> 1    CommCenter                             kEntitlementsUniqueIDCacheKey
> 2           ids                                   identity-rsa-public-key
>> parallel::mclapply(1:10, function(i) keyring::key_list()[1:2,])
>
> *** caught segfault ***
> address 0x110, cause 'memory not mapped'
>
> *** caught segfault ***
> address 0x110, cause 'memory not mapped'
>
> AFAICT only Apple can do anything about this, and they won't.
>
> Gabor
>
>> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
>>
>> Cheers,
>> Simon
>>
>>
>>
>>> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
>>>
>>> The RStudio GUI was just one example.  AFAIK, and please correct me if
>>> I'm wrong, another example is where multi-threaded code is used in
>>> forked processing and that's sometimes unstable.  Yes another, which
>>> might be multi-thread related or not, is
>>> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
>>>
>>> res <- parallel::mclapply(urls, function(url) {
>>> download.file(url, basename(url))
>>> })
>>>
>>> That was reported to fail on macOS with the default method="libcurl"
>>> but not for method="curl" or method="wget".
>>>
>>> Further documentation is needed and would help but I don't believe
>>> it's sufficient to solve everyday problems.  The argument for
>>> introducing an option/env var to disable forking is to give the end
>>> user a quick workaround for newly introduced bugs.  Neither the
>>> develop nor the end user have full control of the R package stack,
>>> which is always in flux.  For instance, above mclapply() code might
>>> have been in a package on CRAN and then all of a sudden
>>> method="libcurl" became the new default in base R.  The above
>>> mclapply() code is now buggy on macOS, and not necessarily caught by
>>> CRAN checks.  The package developer might not notice this because they
>>> are on Linux or Windows.  It can take a very long time before this
>>> problem is even noticed and even further before it is tracked down and
>>> fixed.   Similarly, as more and more code turn to native code and it
>>> becomes easier and easier to implement multi-threading, more and more
>>> of these bugs across package dependencies risk sneaking in the
>>> backdoor wherever forked processing is in place.
>>>
>>> For the end user, but also higher-up upstream package developers, the
>>> quickest workaround would be disable forking.  If you're conservative,
>>> you could even disable it all of your R processing.  Being able to
>>> quickly disable forking will also provide a mechanism for quickly
>>> testing the hypothesis that forking is the underlying problem, i.e.
>>> "Please retry with options(fork.allowed = FALSE)" will become handy
>>> for troubleshooting.
>>>
>>> /Henrik
>>>
>>> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
>>> <[hidden email]> wrote:
>>>>
>>>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>> Sent from my iPhone
>>>>
>>>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
>>>>>>
>>>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
>>>>>> I'd like to pick up this thread started on 2019-04-11
>>>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
>>>>>> Modulo all the other suggestions in this thread, would my proposal of
>>>>>> being able to disable forked processing via an option or an
>>>>>> environment variable make sense?
>>>>>
>>>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
>>>>>
>>>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
>>>>>
>>>>> Tomas
>>>>>
>>>>>> I've prototyped a working patch that
>>>>>> works like:
>>>>>>> options(fork.allowed = FALSE)
>>>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
>>>>>> [1] 14058 14058
>>>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
>>>>>> [1] 14058 14058
>>>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
>>>>>> [1] 14058.1 14058.2
>>>>>>> f <- parallel::mcparallel(Sys.getpid())
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>>> cl <- parallel::makeForkCluster(1L)
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>> The patch is:
>>>>>> Index: src/library/parallel/R/unix/forkCluster.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
>>>>>> @@ -30,6 +30,7 @@
>>>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   options <- addClusterOptions(options, list(...))
>>>>>>   outfile <- getClusterOption("outfile", options)
>>>>>>   port <- getClusterOption("port", options)
>>>>>> Index: src/library/parallel/R/unix/mclapply.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
>>>>>> @@ -28,7 +28,7 @@
>>>>>>       stop("'mc.cores' must be >= 1")
>>>>>>   .check_ncores(cores)
>>>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
>>>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>>>>>>       return(lapply(X = X, FUN = FUN, ...))
>>>>>>   ## Follow lapply
>>>>>> Index: src/library/parallel/R/unix/mcparallel.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
>>>>>> @@ -20,6 +20,7 @@
>>>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
>>>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   f <- mcfork(detached)
>>>>>>   env <- parent.frame()
>>>>>>   if (isTRUE(mc.set.seed)) mc.advance.stream()
>>>>>> Index: src/library/parallel/R/unix/pvec.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
>>>>>> @@ -25,7 +25,7 @@
>>>>>>   cores <- as.integer(mc.cores)
>>>>>>   if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>> -    if(cores == 1L) return(FUN(v, ...))
>>>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>>>>>>   .check_ncores(cores)
>>>>>>   if(mc.set.seed) mc.reset.stream()
>>>>>> with a new file src/library/parallel/R/unix/allowFork.R:
>>>>>> allowFork <- function(assert = FALSE) {
>>>>>>  value <- Sys.getenv("R_FORK_ALLOWED")
>>>>>>  if (nzchar(value)) {
>>>>>>      value <- switch(value,
>>>>>>         "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>>>>>>         "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>>>>>>          stop(gettextf("invalid environment variable value: %s==%s",
>>>>>>         "R_FORK_ALLOWED", value)))
>>>>>> value <- as.logical(value)
>>>>>>  } else {
>>>>>>      value <- TRUE
>>>>>>  }
>>>>>>  value <- getOption("fork.allowed", value)
>>>>>>  if (is.na(value)) {
>>>>>>      stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>>>>>>  }
>>>>>>  if (assert && !value) {
>>>>>>    stop(gettextf("Forked processing is not allowed per option %s or
>>>>>> environment variable %s", sQuote("fork.allowed"),
>>>>>> sQuote("R_FORK_ALLOWED")))
>>>>>>  }
>>>>>>  value
>>>>>> }
>>>>>> /Henrik
>>>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>>>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>>>>>>> I think it's worth saying that mclapply() works as documented
>>>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>>>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>>>>>>> vector.
>>>>>>>>> R cannot possibly replicate documentation of the underlying operating
>>>>>>>>> systems. It clearly says that fork() is used and readers who may not
>>>>>>>>> know what fork() is need to learn it from external sources.
>>>>>>>>> Copy-on-write is an elementary property of fork().
>>>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
>>>>>>>> in most modern *nixes, particularly for the architectures in which R
>>>>>>>> usually runs. But it is not an elementary property; it is not even
>>>>>>>> possible without an MMU.
>>>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>>>>>>> Not relevant today, and certainly not for systems that run R, but indeed
>>>>>>> people interested in OS internals can look elsewhere for more precise
>>>>>>> information.
>>>>>>> Tomas
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
In reply to this post by Gábor Csárdi


> On Jan 10, 2020, at 3:10 PM, Gábor Csárdi <[hidden email]> wrote:
>
> On Fri, Jan 10, 2020 at 7:23 PM Simon Urbanek
> <[hidden email]> wrote:
>>
>> Henrik,
>>
>> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.
>
> I think it works now, because Apple switched to a different SSL
> library for libcurl. It usually crashes or fails on older macOS
> versions, with the CRAN build of R as well.
>
> It is not a bug in any library, it is just that macOS does not support
> fork() without an immediate exec().
>
> In general, any code that calls the macOS system libraries might
> crash. (Except for CoreFoundation, which seems to be fine, but AFAIR
> there is no guarantee for that, either.)
>

That is not true, either. macOS itself is fork-safe (it is POSIX-certified after all), but libraries may or may not. The rules are pretty clear - fork() shares open descriptors and only inherits the main thread (see the POSIX documentation for pthread_atfork() - it illustrates the issues nicely). So as a user of APIs it may be your responsibility to make sure things are handled properly - again, that's what pthread_atfork() is for. Most libraries don't allow duplicated fds or have rules about thread safety, so it is your responsibility in the package to abide by those rules if you want it to function after forking. Some libraries don't allow forking at all, e.g., JVMs cannot be forked (because they are too complex to make them fork-safe). In general, you cannot assume that (non-R) code is fork-safe unless it has been designed to be. That's why mcparallel() should only be used for pure R code (and even that is with I/O limitations) and C code that is explicitly fork-safe. As I said, using mc* functions explicitly says that you are ok with forking, so you if you run code that is not fork-safe it is clearly a user error.

That's exactly why we have the long "Warning" section in the documentation. If you have suggestions for its improvements, please feel free to supply patches.

Cheers,
Simon


> You get crashes in the terminal as well, without multithreading. E.g.
> the keyring package links for the Security library on macOS, so you
> get:
>
> ❯ R --vanilla -q
>> .libPaths("~/R/3.6")
>> keyring::key_list()[1:2,]
>        service                                                  username
> 1    CommCenter                             kEntitlementsUniqueIDCacheKey
> 2           ids                                   identity-rsa-public-key
>> parallel::mclapply(1:10, function(i) keyring::key_list()[1:2,])
>
> *** caught segfault ***
> address 0x110, cause 'memory not mapped'
>
> *** caught segfault ***
> address 0x110, cause 'memory not mapped'
>
> AFAICT only Apple can do anything about this, and they won't.
>
> Gabor
>
>> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
>>
>> Cheers,
>> Simon
>>
>>
>>
>>> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
>>>
>>> The RStudio GUI was just one example.  AFAIK, and please correct me if
>>> I'm wrong, another example is where multi-threaded code is used in
>>> forked processing and that's sometimes unstable.  Yes another, which
>>> might be multi-thread related or not, is
>>> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
>>>
>>> res <- parallel::mclapply(urls, function(url) {
>>> download.file(url, basename(url))
>>> })
>>>
>>> That was reported to fail on macOS with the default method="libcurl"
>>> but not for method="curl" or method="wget".
>>>
>>> Further documentation is needed and would help but I don't believe
>>> it's sufficient to solve everyday problems.  The argument for
>>> introducing an option/env var to disable forking is to give the end
>>> user a quick workaround for newly introduced bugs.  Neither the
>>> develop nor the end user have full control of the R package stack,
>>> which is always in flux.  For instance, above mclapply() code might
>>> have been in a package on CRAN and then all of a sudden
>>> method="libcurl" became the new default in base R.  The above
>>> mclapply() code is now buggy on macOS, and not necessarily caught by
>>> CRAN checks.  The package developer might not notice this because they
>>> are on Linux or Windows.  It can take a very long time before this
>>> problem is even noticed and even further before it is tracked down and
>>> fixed.   Similarly, as more and more code turn to native code and it
>>> becomes easier and easier to implement multi-threading, more and more
>>> of these bugs across package dependencies risk sneaking in the
>>> backdoor wherever forked processing is in place.
>>>
>>> For the end user, but also higher-up upstream package developers, the
>>> quickest workaround would be disable forking.  If you're conservative,
>>> you could even disable it all of your R processing.  Being able to
>>> quickly disable forking will also provide a mechanism for quickly
>>> testing the hypothesis that forking is the underlying problem, i.e.
>>> "Please retry with options(fork.allowed = FALSE)" will become handy
>>> for troubleshooting.
>>>
>>> /Henrik
>>>
>>> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
>>> <[hidden email]> wrote:
>>>>
>>>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>> Sent from my iPhone
>>>>
>>>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
>>>>>>
>>>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
>>>>>> I'd like to pick up this thread started on 2019-04-11
>>>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
>>>>>> Modulo all the other suggestions in this thread, would my proposal of
>>>>>> being able to disable forked processing via an option or an
>>>>>> environment variable make sense?
>>>>>
>>>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
>>>>>
>>>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
>>>>>
>>>>> Tomas
>>>>>
>>>>>> I've prototyped a working patch that
>>>>>> works like:
>>>>>>> options(fork.allowed = FALSE)
>>>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
>>>>>> [1] 14058 14058
>>>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
>>>>>> [1] 14058 14058
>>>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
>>>>>> [1] 14058.1 14058.2
>>>>>>> f <- parallel::mcparallel(Sys.getpid())
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>>> cl <- parallel::makeForkCluster(1L)
>>>>>> Error in allowFork(assert = TRUE) :
>>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>>>> environment variable ‘R_FORK_ALLOWED’
>>>>>> The patch is:
>>>>>> Index: src/library/parallel/R/unix/forkCluster.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
>>>>>> @@ -30,6 +30,7 @@
>>>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   options <- addClusterOptions(options, list(...))
>>>>>>   outfile <- getClusterOption("outfile", options)
>>>>>>   port <- getClusterOption("port", options)
>>>>>> Index: src/library/parallel/R/unix/mclapply.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
>>>>>> @@ -28,7 +28,7 @@
>>>>>>       stop("'mc.cores' must be >= 1")
>>>>>>   .check_ncores(cores)
>>>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
>>>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>>>>>>       return(lapply(X = X, FUN = FUN, ...))
>>>>>>   ## Follow lapply
>>>>>> Index: src/library/parallel/R/unix/mcparallel.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
>>>>>> @@ -20,6 +20,7 @@
>>>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
>>>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>>>>>> {
>>>>>> +    allowFork(assert = TRUE)
>>>>>>   f <- mcfork(detached)
>>>>>>   env <- parent.frame()
>>>>>>   if (isTRUE(mc.set.seed)) mc.advance.stream()
>>>>>> Index: src/library/parallel/R/unix/pvec.R
>>>>>> ===================================================================
>>>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
>>>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
>>>>>> @@ -25,7 +25,7 @@
>>>>>>   cores <- as.integer(mc.cores)
>>>>>>   if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>>>> -    if(cores == 1L) return(FUN(v, ...))
>>>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>>>>>>   .check_ncores(cores)
>>>>>>   if(mc.set.seed) mc.reset.stream()
>>>>>> with a new file src/library/parallel/R/unix/allowFork.R:
>>>>>> allowFork <- function(assert = FALSE) {
>>>>>>  value <- Sys.getenv("R_FORK_ALLOWED")
>>>>>>  if (nzchar(value)) {
>>>>>>      value <- switch(value,
>>>>>>         "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>>>>>>         "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>>>>>>          stop(gettextf("invalid environment variable value: %s==%s",
>>>>>>         "R_FORK_ALLOWED", value)))
>>>>>> value <- as.logical(value)
>>>>>>  } else {
>>>>>>      value <- TRUE
>>>>>>  }
>>>>>>  value <- getOption("fork.allowed", value)
>>>>>>  if (is.na(value)) {
>>>>>>      stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>>>>>>  }
>>>>>>  if (assert && !value) {
>>>>>>    stop(gettextf("Forked processing is not allowed per option %s or
>>>>>> environment variable %s", sQuote("fork.allowed"),
>>>>>> sQuote("R_FORK_ALLOWED")))
>>>>>>  }
>>>>>>  value
>>>>>> }
>>>>>> /Henrik
>>>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>>>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>>>>>>> I think it's worth saying that mclapply() works as documented
>>>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>>>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>>>>>>> vector.
>>>>>>>>> R cannot possibly replicate documentation of the underlying operating
>>>>>>>>> systems. It clearly says that fork() is used and readers who may not
>>>>>>>>> know what fork() is need to learn it from external sources.
>>>>>>>>> Copy-on-write is an elementary property of fork().
>>>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
>>>>>>>> in most modern *nixes, particularly for the architectures in which R
>>>>>>>> usually runs. But it is not an elementary property; it is not even
>>>>>>>> possible without an MMU.
>>>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>>>>>>> Not relevant today, and certainly not for systems that run R, but indeed
>>>>>>> people interested in OS internals can look elsewhere for more precise
>>>>>>> information.
>>>>>>> Tomas
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Gábor Csárdi
On Sat, Jan 11, 2020 at 4:14 AM Simon Urbanek
<[hidden email]> wrote:
[...]
> > In general, any code that calls the macOS system libraries might
> > crash. (Except for CoreFoundation, which seems to be fine, but AFAIR
> > there is no guarantee for that, either.)
> >
>
> That is not true, either. macOS itself is fork-safe (it is POSIX-certified after all), but libraries may or may not.

Right, so CoreFoundation should be fine. Well, as much as it can be.

[...]
>  As I said, using mc* functions explicitly says that you are ok with forking, so you if you run code that is not fork-safe it is clearly a user error.

Just to clarify, this basically means that you cannot use mc* in
portable code. Even base R functions might fail, see e.g. the previous
crashes with libcurl.

> That's exactly why we have the long "Warning" section in the documentation. If you have suggestions for its improvements, please feel free to supply patches.

One suggestion would be to export the parallel::isChild(), so that we
can guard against crashes.

Best,
Gabor

> Cheers,
> Simon
>
>
> > You get crashes in the terminal as well, without multithreading. E.g.
> > the keyring package links for the Security library on macOS, so you
> > get:
> >
> > ❯ R --vanilla -q
> >> .libPaths("~/R/3.6")
> >> keyring::key_list()[1:2,]
> >        service                                                  username
> > 1    CommCenter                             kEntitlementsUniqueIDCacheKey
> > 2           ids                                   identity-rsa-public-key
> >> parallel::mclapply(1:10, function(i) keyring::key_list()[1:2,])
> >
> > *** caught segfault ***
> > address 0x110, cause 'memory not mapped'
> >
> > *** caught segfault ***
> > address 0x110, cause 'memory not mapped'
> >
> > AFAICT only Apple can do anything about this, and they won't.
> >
> > Gabor
> >
> >> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
> >>
> >> Cheers,
> >> Simon
> >>
> >>
> >>
> >>> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
> >>>
> >>> The RStudio GUI was just one example.  AFAIK, and please correct me if
> >>> I'm wrong, another example is where multi-threaded code is used in
> >>> forked processing and that's sometimes unstable.  Yes another, which
> >>> might be multi-thread related or not, is
> >>> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
> >>>
> >>> res <- parallel::mclapply(urls, function(url) {
> >>> download.file(url, basename(url))
> >>> })
> >>>
> >>> That was reported to fail on macOS with the default method="libcurl"
> >>> but not for method="curl" or method="wget".
> >>>
> >>> Further documentation is needed and would help but I don't believe
> >>> it's sufficient to solve everyday problems.  The argument for
> >>> introducing an option/env var to disable forking is to give the end
> >>> user a quick workaround for newly introduced bugs.  Neither the
> >>> develop nor the end user have full control of the R package stack,
> >>> which is always in flux.  For instance, above mclapply() code might
> >>> have been in a package on CRAN and then all of a sudden
> >>> method="libcurl" became the new default in base R.  The above
> >>> mclapply() code is now buggy on macOS, and not necessarily caught by
> >>> CRAN checks.  The package developer might not notice this because they
> >>> are on Linux or Windows.  It can take a very long time before this
> >>> problem is even noticed and even further before it is tracked down and
> >>> fixed.   Similarly, as more and more code turn to native code and it
> >>> becomes easier and easier to implement multi-threading, more and more
> >>> of these bugs across package dependencies risk sneaking in the
> >>> backdoor wherever forked processing is in place.
> >>>
> >>> For the end user, but also higher-up upstream package developers, the
> >>> quickest workaround would be disable forking.  If you're conservative,
> >>> you could even disable it all of your R processing.  Being able to
> >>> quickly disable forking will also provide a mechanism for quickly
> >>> testing the hypothesis that forking is the underlying problem, i.e.
> >>> "Please retry with options(fork.allowed = FALSE)" will become handy
> >>> for troubleshooting.
> >>>
> >>> /Henrik
> >>>
> >>> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
> >>> <[hidden email]> wrote:
> >>>>
> >>>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
> >>>>
> >>>> Cheers,
> >>>> Simon
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>
> >>>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> >>>>>> I'd like to pick up this thread started on 2019-04-11
> >>>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> >>>>>> Modulo all the other suggestions in this thread, would my proposal of
> >>>>>> being able to disable forked processing via an option or an
> >>>>>> environment variable make sense?
> >>>>>
> >>>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
> >>>>>
> >>>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
> >>>>>
> >>>>> Tomas
> >>>>>
> >>>>>> I've prototyped a working patch that
> >>>>>> works like:
> >>>>>>> options(fork.allowed = FALSE)
> >>>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> >>>>>> [1] 14058 14058
> >>>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> >>>>>> [1] 14058 14058
> >>>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> >>>>>> [1] 14058.1 14058.2
> >>>>>>> f <- parallel::mcparallel(Sys.getpid())
> >>>>>> Error in allowFork(assert = TRUE) :
> >>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>>>> environment variable ‘R_FORK_ALLOWED’
> >>>>>>> cl <- parallel::makeForkCluster(1L)
> >>>>>> Error in allowFork(assert = TRUE) :
> >>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>>>> environment variable ‘R_FORK_ALLOWED’
> >>>>>> The patch is:
> >>>>>> Index: src/library/parallel/R/unix/forkCluster.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> >>>>>> @@ -30,6 +30,7 @@
> >>>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
> >>>>>> {
> >>>>>> +    allowFork(assert = TRUE)
> >>>>>>   options <- addClusterOptions(options, list(...))
> >>>>>>   outfile <- getClusterOption("outfile", options)
> >>>>>>   port <- getClusterOption("port", options)
> >>>>>> Index: src/library/parallel/R/unix/mclapply.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> >>>>>> @@ -28,7 +28,7 @@
> >>>>>>       stop("'mc.cores' must be >= 1")
> >>>>>>   .check_ncores(cores)
> >>>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
> >>>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
> >>>>>>       return(lapply(X = X, FUN = FUN, ...))
> >>>>>>   ## Follow lapply
> >>>>>> Index: src/library/parallel/R/unix/mcparallel.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> >>>>>> @@ -20,6 +20,7 @@
> >>>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> >>>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
> >>>>>> {
> >>>>>> +    allowFork(assert = TRUE)
> >>>>>>   f <- mcfork(detached)
> >>>>>>   env <- parent.frame()
> >>>>>>   if (isTRUE(mc.set.seed)) mc.advance.stream()
> >>>>>> Index: src/library/parallel/R/unix/pvec.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
> >>>>>> @@ -25,7 +25,7 @@
> >>>>>>   cores <- as.integer(mc.cores)
> >>>>>>   if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>>> -    if(cores == 1L) return(FUN(v, ...))
> >>>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
> >>>>>>   .check_ncores(cores)
> >>>>>>   if(mc.set.seed) mc.reset.stream()
> >>>>>> with a new file src/library/parallel/R/unix/allowFork.R:
> >>>>>> allowFork <- function(assert = FALSE) {
> >>>>>>  value <- Sys.getenv("R_FORK_ALLOWED")
> >>>>>>  if (nzchar(value)) {
> >>>>>>      value <- switch(value,
> >>>>>>         "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
> >>>>>>         "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
> >>>>>>          stop(gettextf("invalid environment variable value: %s==%s",
> >>>>>>         "R_FORK_ALLOWED", value)))
> >>>>>> value <- as.logical(value)
> >>>>>>  } else {
> >>>>>>      value <- TRUE
> >>>>>>  }
> >>>>>>  value <- getOption("fork.allowed", value)
> >>>>>>  if (is.na(value)) {
> >>>>>>      stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
> >>>>>>  }
> >>>>>>  if (assert && !value) {
> >>>>>>    stop(gettextf("Forked processing is not allowed per option %s or
> >>>>>> environment variable %s", sQuote("fork.allowed"),
> >>>>>> sQuote("R_FORK_ALLOWED")))
> >>>>>>  }
> >>>>>>  value
> >>>>>> }
> >>>>>> /Henrik
> >>>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
> >>>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> >>>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>>>>>>>>> I think it's worth saying that mclapply() works as documented
> >>>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>>>>>>>>> overcommitment, and that this means that it may work nicely or fail
> >>>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
> >>>>>>>>>> vector.
> >>>>>>>>> R cannot possibly replicate documentation of the underlying operating
> >>>>>>>>> systems. It clearly says that fork() is used and readers who may not
> >>>>>>>>> know what fork() is need to learn it from external sources.
> >>>>>>>>> Copy-on-write is an elementary property of fork().
> >>>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
> >>>>>>>> in most modern *nixes, particularly for the architectures in which R
> >>>>>>>> usually runs. But it is not an elementary property; it is not even
> >>>>>>>> possible without an MMU.
> >>>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
> >>>>>>> Not relevant today, and certainly not for systems that run R, but indeed
> >>>>>>> people interested in OS internals can look elsewhere for more precise
> >>>>>>> information.
> >>>>>>> Tomas
> >>>>>
> >>>>> ______________________________________________
> >>>>> [hidden email] mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>
> >>>> ______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Gábor Csárdi
In reply to this post by Simon Urbanek
On Sat, Jan 11, 2020 at 12:53 PM Simon Urbanek
<[hidden email]> wrote:
[...]
> > I think it works now, because Apple switched to a different SSL
> > library for libcurl. It usually crashes or fails on older macOS
> > versions, with the CRAN build of R as well.
> >
>
> That is not true - Apple has not changed the SSL back-end for many years. The issue in that post is presumably in the homebrew version of SSL.

This is CRAN R 3.6.2 on El Capitan crashing for HTTPS. Homebrew is not
installed on the machine.

> download.file("https://httpbin.org/status/200", tempfile(), method = "libcurl")
trying URL 'https://httpbin.org/status/200'
downloaded 0 bytes

> parallel::mclapply(1:10, function(x) download.file("https://httpbin.org/status/200", tempfile(), method = "libcurl"))
trying URL 'https://httpbin.org/status/200'
trying URL 'https://httpbin.org/status/200'

 *** caught segfault ***

address 0x110, cause 'memory not mapped'

I assume the crash is coming from the SSL library/ies because it does
work for HTTP, only crashes for HTTPS.

FWIW these are the libraries the system libcurl uses on El Capitan:

Rhubs-Mac-3:macoscheck rhub$ otool -L /usr/lib/libcurl.4.dylib
/usr/lib/libcurl.4.dylib:
     /usr/lib/libcurl.4.dylib (compatibility version 7.0.0, current
version 8.0.0)
     /System/Library/Frameworks/Security.framework/Versions/A/Security
(compatibility version 1.0.0, current version 57332.0.0)
     /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
(compatibility version 150.0.0, current version 1231.0.0)
     /System/Library/Frameworks/LDAP.framework/Versions/A/LDAP
(compatibility version 1.0.0, current version 2.4.0)
     /System/Library/Frameworks/Kerberos.framework/Versions/A/Kerberos
(compatibility version 5.0.0, current version 6.0.0)
     /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.5)
     /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1225.0.0)

and this is Mojave:

❯ otool -L /usr/lib/libcurl.4.dylib
/usr/lib/libcurl.4.dylib:
     /usr/lib/libcurl.4.dylib (compatibility version 7.0.0, current
version 9.0.0)
     /usr/lib/libcrypto.42.dylib (compatibility version 43.0.0,
current version 43.0.0)
     /usr/lib/libssl.44.dylib (compatibility version 45.0.0, current
version 45.1.0)
     /System/Library/Frameworks/LDAP.framework/Versions/A/LDAP
(compatibility version 1.0.0, current version 2.4.0)
     /System/Library/Frameworks/Kerberos.framework/Versions/A/Kerberos
(compatibility version 5.0.0, current version 6.0.0)
     /usr/lib/libapple_nghttp2.dylib (compatibility version 1.0.0,
current version 1.24.1)
     /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11)
     /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1252.250.1)

So the SSL libs do look different to me, but please FIXME. Relatedly,
on El Capitan an HTTPS download with download.file() starts up extra
threads, whereas on Mojave it does not.

Gabor

> Cheers,
> Simon
[...]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Henrik Bengtsson-5
In reply to this post by Simon Urbanek
On Fri, Jan 10, 2020 at 7:23 PM Simon Urbanek
<[hidden email]> wrote:
>
> Henrik,
>
> the whole point and only purpose of mc* functions is to fork. That's what the multicore package was about, so if you don't want to fork, don't use mc* functions - they don't have any other purpose.

But, with that same argument I'm surprised we have fake
implementations of the mc***() functions for MS Windows.

> I really fail to see the point - if you use mc* functions you're very explicitly asking for forking - so your argument is like saying that print() should have an option to not print anything - it just makes no sense. If you have code that is fork-incompatilble, you clearly cannot use it in mcparallel - that's why there is a very explicit warning in the documentation.

I think you're casting the word "you" too wide here - what I've been
trying to argue is that users don't always have this control - the
length of loadedNamespaces() can easily be 100's of packages these
days.  Usage of forking will come and go in that software stack
completely out of control to the user. They could be in control if
there was an option to disable it.

I don't think the analogue to print() is relevant here.

> As I said, if you have some software that embeds R and has issue with forks, then that software should be use pthread_atfork() to control the behavior.

If I understand this correctly, this only applies to discussions such
as RStudio and similar.  It is not a relevant solution to ordinary
users of R, correct?

I'm not sure if you're playing devils advocate here, or I'm really
really bad at explaining the problem, but here's my last attempt of an
example:

Consider PkgA::train() that calls mclapply(..., PkgB::estimate). Life
is good. Then PkgB fixes a bug in the estimator + optimizes some the
internals/native code - unfortunately, the latter is not meant to work
in forked parallel processing.  Author of PkgB have not idea that this
change will affect PkgA negatively. When PkgB hits CRAN, the problem
in PkgA might go undetected, e.g. low code coverage in package tests
or for other reasons.  Now CRAN distributes an unstable software
stack.  All code bases relying on PkgA::train() is now completely
useless to some or all users.  Only after major troubleshooting and
responsive package maintainers, these type of problems can be resolved
quickly - until then, users will have to turn to complicated
workarounds such as installing old versions of PkgA + backporting bug
fixes, etc.   A lot of users don't even consider this as an option.

Usage of 'parallel::mc***()' functions is very common; 357 packages on
CRAN call them directly, and even more indirectly via doMC and
doParallel.  I expect the risk for instability to happen for at least
some of these.

If the stance on this is that mclapply() and friends should only be
used when you are 100% sure that it is fork safe, then I'd like to
argue that those functions should never ever be allowed to call other
packages, because you are not in full control of such code.  This
should certainly be documented and it shouldn't be too hard to add
assertions for this in 'R CMD check'.

A final plead: Adding an option to disable forking, at least in the
'parallel' package only, will spare people (end users, developers,
sysadms, ...) many many hours of troubleshooting and eventually trying
to find workarounds. Those hours adds up quickly given the number of R
users we have out there.  We have more important things to spend our
time on.  I can easily count days wasted due to troubleshooting and
helping others remotely on problems related to instability of forked
processing. Being able to disable it, would have shortcut this quite a
bit.

/Henrik

>
> Cheers,
> Simon
>
>
>
> > On Jan 10, 2020, at 3:34 PM, Henrik Bengtsson <[hidden email]> wrote:
> >
> > On Fri, Jan 10, 2020 at 11:23 AM Simon Urbanek
> > <[hidden email]> wrote:
> >>
> >> Henrik,
> >>
> >> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.
> >
> > Thanks for ruling that example out.
> >
> >> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
> >
> > I think this is worth pursuing and will help improve and stabilize
> > things.  But issuing a warning or stop on fork will not allow end
> > users from running the pipeline, or am I missing something?
> >
> > I'm trying to argue that this is still a real problem that users and
> > developers run into on a regular basis.  Since parallel::mclapply() is
> > such a common and readily available solution it is also a low hanging
> > fruit to make it possible to have those forking functions fall back to
> > sequential processing.  The only(*) way to achieve this fall back
> > right now is to run the same pipeline on MS Windows - I just think it
> > would be very useful to have the same fallback option available on
> > Unix and macOS.  Having this in base R could also serve as standard
> > for other parallel/forking packages/implementations who also wish to
> > have a fallback to sequential processing.
> >
> > ==> What would the disadvantages be to provide a mechanism/setting for
> > disabling forking in the parallel::mc*** API? <==
> >
> > (*) One can also somewhat disable forking in 'parallel' by using
> > 'cgroups' limiting the process to a single core (see also
> > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17641).  That will
> > handle code that uses mc.cores = parallel::detectCores(), which there
> > is a lot of.  I guess it will cause run-time error (on 'mc.cores' must
> > be >= 1) for code that uses the second most common used mc.cores =
> > parallel::detectCores() - 1, which is unfortunately also very common.
> > I find the use of hardcoded detectCores() unfortunate but that is a
> > slightly different topic.  OTH, if there would a standardized option
> > in R for disabling all types of parallel processing by forcing a
> > single core, one could imagine other parallel APIs to implement
> > fallbacks to sequential processing as well. (I'm aware that not all
> > use cases of async processing is about parallelization, so it might
> > not apply everywhere).
> >
> > Cheers,
> >
> > Henrik
> >
> >>
> >> Cheers,
> >> Simon
> >>
> >>
> >>
> >>> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
> >>>
> >>> The RStudio GUI was just one example.  AFAIK, and please correct me if
> >>> I'm wrong, another example is where multi-threaded code is used in
> >>> forked processing and that's sometimes unstable.  Yes another, which
> >>> might be multi-thread related or not, is
> >>> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
> >>>
> >>> res <- parallel::mclapply(urls, function(url) {
> >>> download.file(url, basename(url))
> >>> })
> >>>
> >>> That was reported to fail on macOS with the default method="libcurl"
> >>> but not for method="curl" or method="wget".
> >>>
> >>> Further documentation is needed and would help but I don't believe
> >>> it's sufficient to solve everyday problems.  The argument for
> >>> introducing an option/env var to disable forking is to give the end
> >>> user a quick workaround for newly introduced bugs.  Neither the
> >>> develop nor the end user have full control of the R package stack,
> >>> which is always in flux.  For instance, above mclapply() code might
> >>> have been in a package on CRAN and then all of a sudden
> >>> method="libcurl" became the new default in base R.  The above
> >>> mclapply() code is now buggy on macOS, and not necessarily caught by
> >>> CRAN checks.  The package developer might not notice this because they
> >>> are on Linux or Windows.  It can take a very long time before this
> >>> problem is even noticed and even further before it is tracked down and
> >>> fixed.   Similarly, as more and more code turn to native code and it
> >>> becomes easier and easier to implement multi-threading, more and more
> >>> of these bugs across package dependencies risk sneaking in the
> >>> backdoor wherever forked processing is in place.
> >>>
> >>> For the end user, but also higher-up upstream package developers, the
> >>> quickest workaround would be disable forking.  If you're conservative,
> >>> you could even disable it all of your R processing.  Being able to
> >>> quickly disable forking will also provide a mechanism for quickly
> >>> testing the hypothesis that forking is the underlying problem, i.e.
> >>> "Please retry with options(fork.allowed = FALSE)" will become handy
> >>> for troubleshooting.
> >>>
> >>> /Henrik
> >>>
> >>> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
> >>> <[hidden email]> wrote:
> >>>>
> >>>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
> >>>>
> >>>> Cheers,
> >>>> Simon
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>
> >>>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> >>>>>> I'd like to pick up this thread started on 2019-04-11
> >>>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> >>>>>> Modulo all the other suggestions in this thread, would my proposal of
> >>>>>> being able to disable forked processing via an option or an
> >>>>>> environment variable make sense?
> >>>>>
> >>>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
> >>>>>
> >>>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
> >>>>>
> >>>>> Tomas
> >>>>>
> >>>>>> I've prototyped a working patch that
> >>>>>> works like:
> >>>>>>> options(fork.allowed = FALSE)
> >>>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> >>>>>> [1] 14058 14058
> >>>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> >>>>>> [1] 14058 14058
> >>>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> >>>>>> [1] 14058.1 14058.2
> >>>>>>> f <- parallel::mcparallel(Sys.getpid())
> >>>>>> Error in allowFork(assert = TRUE) :
> >>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>>>> environment variable ‘R_FORK_ALLOWED’
> >>>>>>> cl <- parallel::makeForkCluster(1L)
> >>>>>> Error in allowFork(assert = TRUE) :
> >>>>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>>>> environment variable ‘R_FORK_ALLOWED’
> >>>>>> The patch is:
> >>>>>> Index: src/library/parallel/R/unix/forkCluster.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> >>>>>> @@ -30,6 +30,7 @@
> >>>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
> >>>>>> {
> >>>>>> +    allowFork(assert = TRUE)
> >>>>>>   options <- addClusterOptions(options, list(...))
> >>>>>>   outfile <- getClusterOption("outfile", options)
> >>>>>>   port <- getClusterOption("port", options)
> >>>>>> Index: src/library/parallel/R/unix/mclapply.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> >>>>>> @@ -28,7 +28,7 @@
> >>>>>>       stop("'mc.cores' must be >= 1")
> >>>>>>   .check_ncores(cores)
> >>>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
> >>>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
> >>>>>>       return(lapply(X = X, FUN = FUN, ...))
> >>>>>>   ## Follow lapply
> >>>>>> Index: src/library/parallel/R/unix/mcparallel.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> >>>>>> @@ -20,6 +20,7 @@
> >>>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> >>>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
> >>>>>> {
> >>>>>> +    allowFork(assert = TRUE)
> >>>>>>   f <- mcfork(detached)
> >>>>>>   env <- parent.frame()
> >>>>>>   if (isTRUE(mc.set.seed)) mc.advance.stream()
> >>>>>> Index: src/library/parallel/R/unix/pvec.R
> >>>>>> ===================================================================
> >>>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> >>>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
> >>>>>> @@ -25,7 +25,7 @@
> >>>>>>   cores <- as.integer(mc.cores)
> >>>>>>   if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>>>> -    if(cores == 1L) return(FUN(v, ...))
> >>>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
> >>>>>>   .check_ncores(cores)
> >>>>>>   if(mc.set.seed) mc.reset.stream()
> >>>>>> with a new file src/library/parallel/R/unix/allowFork.R:
> >>>>>> allowFork <- function(assert = FALSE) {
> >>>>>>  value <- Sys.getenv("R_FORK_ALLOWED")
> >>>>>>  if (nzchar(value)) {
> >>>>>>      value <- switch(value,
> >>>>>>         "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
> >>>>>>         "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
> >>>>>>          stop(gettextf("invalid environment variable value: %s==%s",
> >>>>>>         "R_FORK_ALLOWED", value)))
> >>>>>> value <- as.logical(value)
> >>>>>>  } else {
> >>>>>>      value <- TRUE
> >>>>>>  }
> >>>>>>  value <- getOption("fork.allowed", value)
> >>>>>>  if (is.na(value)) {
> >>>>>>      stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
> >>>>>>  }
> >>>>>>  if (assert && !value) {
> >>>>>>    stop(gettextf("Forked processing is not allowed per option %s or
> >>>>>> environment variable %s", sQuote("fork.allowed"),
> >>>>>> sQuote("R_FORK_ALLOWED")))
> >>>>>>  }
> >>>>>>  value
> >>>>>> }
> >>>>>> /Henrik
> >>>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
> >>>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> >>>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>>>>>>>>> I think it's worth saying that mclapply() works as documented
> >>>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>>>>>>>>> overcommitment, and that this means that it may work nicely or fail
> >>>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
> >>>>>>>>>> vector.
> >>>>>>>>> R cannot possibly replicate documentation of the underlying operating
> >>>>>>>>> systems. It clearly says that fork() is used and readers who may not
> >>>>>>>>> know what fork() is need to learn it from external sources.
> >>>>>>>>> Copy-on-write is an elementary property of fork().
> >>>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
> >>>>>>>> in most modern *nixes, particularly for the architectures in which R
> >>>>>>>> usually runs. But it is not an elementary property; it is not even
> >>>>>>>> possible without an MMU.
> >>>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
> >>>>>>> Not relevant today, and certainly not for systems that run R, but indeed
> >>>>>>> people interested in OS internals can look elsewhere for more precise
> >>>>>>> information.
> >>>>>>> Tomas
> >>>>>
> >>>>> ______________________________________________
> >>>>> [hidden email] mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>
> >>>> ______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
On Sun, 12 Jan 2020 at 00:49, Henrik Bengtsson
<[hidden email]> wrote:

>
> [snip]
>
> A final plead: Adding an option to disable forking, at least in the
> 'parallel' package only, will spare people (end users, developers,
> sysadms, ...) many many hours of troubleshooting and eventually trying
> to find workarounds. Those hours adds up quickly given the number of R
> users we have out there.  We have more important things to spend our
> time on.  I can easily count days wasted due to troubleshooting and
> helping others remotely on problems related to instability of forked
> processing. Being able to disable it, would have shortcut this quite a
> bit.

+1 to such an option. I don't see how this could be implemented in
another package. One could do something like

stop_on_fork <- inline::cfunction(
  body='pthread_atfork(stop, NULL, NULL);',
  includes='#include <pthread.h>', convention=".C",
  otherdefs='void stop() { Rf_error("Fork disabled"); }')
stop_on_fork()
parallel::mclapply(1:2, force)

which works nice in a standalone R session, but freezes RStudio.
Another workaround would be

unlockBinding("mclapply", getNamespace("parallel"))
assignInNamespace("mclapply", function(...) stop("Fork disabled"), "parallel")
parallel::mclapply(1:2, force)

(plus several more bindings to cover all the cases), but that's not
allowed, and shouldn't be allowed, on CRAN.

Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
12