SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Henrik Bengtsson-5
ISSUE:
Using *forks* for parallel processing in R is not always safe.  The
`parallel::mclapply()` function uses forked processes to parallelize.
One example where it has been confirmed that forked processing causes
problems is when running R via RStudio.  It is recommended to use
PSOCK clusters (`parallel::makeCluster()`) rather than *forked*
processes when running R from RStudio (
https://github.com/rstudio/rstudio/issues/2597#issuecomment-482187011).

AFAIK, it is not straightforward to disable forked processing in R.

One could set environment variable `MC_CORES=1` which will set R
option `mc.cores=1` when the parallel package is loaded.  Since
`mc.cores = getOption("mc.cores", 2L)` is the default for
`parallel::mclapply()`, this will cause `mclapply()` to fall back to
`lapply()` avoiding _forked_ processing.  However, this does not work
when the code specifies argument `mc.cores`, e.g. `mclapply(...,
mc.cores = detectCores())`.


SUGGESTION:
Introduce environment variable `R_ENABLE_FORKS` and corresponding R
option `enable.forks` that both take logical scalars.  By setting
`R_ENABLE_FORKS=false` or equivalently `enable.forks=FALSE`,
`parallel::mclapply()` will fall back to `lapply()`.

For `parallel::mcparallel()`, we could produce an error if forks are disabled.


Comments?

/Henrik

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson
<[hidden email]> wrote:
>
> ISSUE:
> Using *forks* for parallel processing in R is not always safe.
> [...]
> Comments?

Using fork() is never safe. The reference provided by Kevin [1] is
pretty compelling (I kindly encourage anyone who ever forked a process
to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd
advocate for deprecating fork clusters and eventually removing them
from parallel.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf

--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Travers Ching
Just throwing my two cents in:

I think removing/deprecating fork would be a bad idea for two reasons:

1) There are no performant alternatives
2) Removing fork would break existing workflows

Even if replaced with something using the same interface (e.g., a
function that automatically detects variables to export as in the
amazing `future` package), the lack of copy-on-write functionality
would cause scripts everywhere to break.

A simple example illustrating these two points:
`x <- 5e8; mclapply(1:24, sum, x, 8)`

Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
does not complete.

Travers

On Fri, Apr 12, 2019 at 2:32 AM Iñaki Ucar <[hidden email]> wrote:

>
> On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson
> <[hidden email]> wrote:
> >
> > ISSUE:
> > Using *forks* for parallel processing in R is not always safe.
> > [...]
> > Comments?
>
> Using fork() is never safe. The reference provided by Kevin [1] is
> pretty compelling (I kindly encourage anyone who ever forked a process
> to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd
> advocate for deprecating fork clusters and eventually removing them
> from parallel.
>
> [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
>
> --
> Iñaki Úcar
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
On Fri, 12 Apr 2019 at 21:32, Travers Ching <[hidden email]> wrote:
>
> Just throwing my two cents in:
>
> I think removing/deprecating fork would be a bad idea for two reasons:
>
> 1) There are no performant alternatives

"Performant"... in terms of what. If the cost of copying the data
predominates over the computation time, maybe you didn't need
parallelization in the first place.

> 2) Removing fork would break existing workflows

I don't see why mclapply could not be rewritten using PSOCK clusters.
And as a side effect, this would enable those workflows on Windows,
which doesn't support fork.

> Even if replaced with something using the same interface (e.g., a
> function that automatically detects variables to export as in the
> amazing `future` package), the lack of copy-on-write functionality
> would cause scripts everywhere to break.

To implement copy-on-write, Linux overcommits virtual memory, and this
is what causes scripts to break unexpectedly: everything works fine,
until you change a small unimportant bit and... boom, out of memory.
And in general, running forks in any GUI would cause things everywhere
to break.

> A simple example illustrating these two points:
> `x <- 5e8; mclapply(1:24, sum, x, 8)`
>
> Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> does not complete.

I'm not sure how did you setup that, but it does complete. Or do you
mean that you ran out of memory? Then try replacing "x" with, e.g.,
"x+1" in your mclapply example and see what happens (hint: save your
work first).

--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Travers Ching
Hi Inaki,

> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.

Performant in terms of speed.  There's no copying in that example
using `mclapply` and so it is significantly faster than other
alternatives.

It is a very simple and contrived example, but there are lots of
applications that depend on processing of large data and benefit from
multithreading.  For example, if I read in large sequencing data with
`Rsamtools` and want to check sequences for a set of motifs.

> I don't see why mclapply could not be rewritten using PSOCK clusters.

Because it would be much slower.

> To implement copy-on-write, Linux overcommits virtual memory, and this
>  is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.

> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).

Yes, I meant that it ran out of memory on my desktop.  I understand
the limits, and it is not perfect because of the GUI issue you
mention, but I don't see a better alternative in terms of speed.

Regards,
Travers




On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar <[hidden email]> wrote:

>
> On Fri, 12 Apr 2019 at 21:32, Travers Ching <[hidden email]> wrote:
> >
> > Just throwing my two cents in:
> >
> > I think removing/deprecating fork would be a bad idea for two reasons:
> >
> > 1) There are no performant alternatives
>
> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.
>
> > 2) Removing fork would break existing workflows
>
> I don't see why mclapply could not be rewritten using PSOCK clusters.
> And as a side effect, this would enable those workflows on Windows,
> which doesn't support fork.
>
> > Even if replaced with something using the same interface (e.g., a
> > function that automatically detects variables to export as in the
> > amazing `future` package), the lack of copy-on-write functionality
> > would cause scripts everywhere to break.
>
> To implement copy-on-write, Linux overcommits virtual memory, and this
> is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.
>
> > A simple example illustrating these two points:
> > `x <- 5e8; mclapply(1:24, sum, x, 8)`
> >
> > Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> > does not complete.
>
> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).
>
> --
> Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Kevin Ushey
I think it's worth saying that mclapply() works as documented: it
relies on forking, and so doesn't work well in environments where it's
unsafe to fork. This is spelled out explicitly in the documentation of
?mclapply:

It is strongly discouraged to use these functions in GUI or embedded
environments, because it leads to several processes sharing the same
GUI which will likely cause chaos (and possibly crashes). Child
processes should never use on-screen graphics devices.

I believe the expectation is that users who need more control over the
kind of cluster that's used for parallel computations would instead
create the cluster themselves with e.g. `makeCluster()` and then use
`clusterApply()` / `parLapply()` or other APIs as appropriate.

In environments where forking works, `mclapply()` is nice because you
don't need to think -- the process is forked, and anything available
in your main session is automatically available in the child
processes. This is a nice convenience for when you know it's safe to
fork R (and know what you're doing is safe to do within a forked
process). When it's not safe, it's better to prefer the other APIs
available for computation on a cluster.

Forking can be unsafe and dangerous, but it's also convenient and
sometimes that convenience can outweigh the other concerns.

Finally, I want to add: the onus should be on the front-end to work
well with R, and not the other way around. I don't think it's fair to
impose extra work / an extra maintenance burden on the R Core team for
something that's already clearly documented ...

Best,
Kevin


On Fri, Apr 12, 2019 at 6:04 PM Travers Ching <[hidden email]> wrote:

>
> Hi Inaki,
>
> > "Performant"... in terms of what. If the cost of copying the data
> > predominates over the computation time, maybe you didn't need
> > parallelization in the first place.
>
> Performant in terms of speed.  There's no copying in that example
> using `mclapply` and so it is significantly faster than other
> alternatives.
>
> It is a very simple and contrived example, but there are lots of
> applications that depend on processing of large data and benefit from
> multithreading.  For example, if I read in large sequencing data with
> `Rsamtools` and want to check sequences for a set of motifs.
>
> > I don't see why mclapply could not be rewritten using PSOCK clusters.
>
> Because it would be much slower.
>
> > To implement copy-on-write, Linux overcommits virtual memory, and this
> >  is what causes scripts to break unexpectedly: everything works fine,
> > until you change a small unimportant bit and... boom, out of memory.
> > And in general, running forks in any GUI would cause things everywhere
> > to break.
>
> > I'm not sure how did you setup that, but it does complete. Or do you
> > mean that you ran out of memory? Then try replacing "x" with, e.g.,
> > "x+1" in your mclapply example and see what happens (hint: save your
> > work first).
>
> Yes, I meant that it ran out of memory on my desktop.  I understand
> the limits, and it is not perfect because of the GUI issue you
> mention, but I don't see a better alternative in terms of speed.
>
> Regards,
> Travers
>
>
>
>
> On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar <[hidden email]> wrote:
> >
> > On Fri, 12 Apr 2019 at 21:32, Travers Ching <[hidden email]> wrote:
> > >
> > > Just throwing my two cents in:
> > >
> > > I think removing/deprecating fork would be a bad idea for two reasons:
> > >
> > > 1) There are no performant alternatives
> >
> > "Performant"... in terms of what. If the cost of copying the data
> > predominates over the computation time, maybe you didn't need
> > parallelization in the first place.
> >
> > > 2) Removing fork would break existing workflows
> >
> > I don't see why mclapply could not be rewritten using PSOCK clusters.
> > And as a side effect, this would enable those workflows on Windows,
> > which doesn't support fork.
> >
> > > Even if replaced with something using the same interface (e.g., a
> > > function that automatically detects variables to export as in the
> > > amazing `future` package), the lack of copy-on-write functionality
> > > would cause scripts everywhere to break.
> >
> > To implement copy-on-write, Linux overcommits virtual memory, and this
> > is what causes scripts to break unexpectedly: everything works fine,
> > until you change a small unimportant bit and... boom, out of memory.
> > And in general, running forks in any GUI would cause things everywhere
> > to break.
> >
> > > A simple example illustrating these two points:
> > > `x <- 5e8; mclapply(1:24, sum, x, 8)`
> > >
> > > Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> > > does not complete.
> >
> > I'm not sure how did you setup that, but it does complete. Or do you
> > mean that you ran out of memory? Then try replacing "x" with, e.g.,
> > "x+1" in your mclapply example and see what happens (hint: save your
> > work first).
> >
> > --
> > Iñaki Úcar
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
I fully agree with Kevin. Front-ends can always use pthread_atfork() to close descriptors and suspend threads in children.

Anyone who thinks you can use PSOCK clusters has obviously not used mclappy() in real applications - trying to save the workspace and restore it in 20 new processes is not only incredibly wasteful (no shared memory whatsoever) but slow. If you want to use PSOCK just do it (I never do - you might as well just use a full cluster instead), multicore is for the cases where you want to parallelize something quickly and it works really well for that purpose.

I'd like to separate the issues here - the fact that RStudio has issues is really not R's fault - there is no technical reason why it shouldn't be able to handle it correctly. That is not to say that there are cases where fork() is dangerous, but in most cases it's not and the benefits outweigh
the risk.

That said, I do acknowledge the idea of having an ability to prevent forking if desired - I think that's a good idea, in particular if there is a standard that packages can also adhere to it (yes, there are also packages that use fork() explicitly). I just think that the motivation is wrong (i.e., I don't think it would be wise for RStudio to prevent parallelization by default).

Also I'd like to point out that the main problem came about when packages started using parallel implicitly - the good citizens out there expose it as a parameter to the user, but not all packages do it which means you can hit forked code without knowing it. If you use mclapply() in user code, you typically know what you're doing, but if a package author does it for you, it's a different story.

Cheers,
Simon


> On Apr 12, 2019, at 21:50, Kevin Ushey <[hidden email]> wrote:
>
> I think it's worth saying that mclapply() works as documented: it
> relies on forking, and so doesn't work well in environments where it's
> unsafe to fork. This is spelled out explicitly in the documentation of
> ?mclapply:
>
> It is strongly discouraged to use these functions in GUI or embedded
> environments, because it leads to several processes sharing the same
> GUI which will likely cause chaos (and possibly crashes). Child
> processes should never use on-screen graphics devices.
>
> I believe the expectation is that users who need more control over the
> kind of cluster that's used for parallel computations would instead
> create the cluster themselves with e.g. `makeCluster()` and then use
> `clusterApply()` / `parLapply()` or other APIs as appropriate.
>
> In environments where forking works, `mclapply()` is nice because you
> don't need to think -- the process is forked, and anything available
> in your main session is automatically available in the child
> processes. This is a nice convenience for when you know it's safe to
> fork R (and know what you're doing is safe to do within a forked
> process). When it's not safe, it's better to prefer the other APIs
> available for computation on a cluster.
>
> Forking can be unsafe and dangerous, but it's also convenient and
> sometimes that convenience can outweigh the other concerns.
>
> Finally, I want to add: the onus should be on the front-end to work
> well with R, and not the other way around. I don't think it's fair to
> impose extra work / an extra maintenance burden on the R Core team for
> something that's already clearly documented ...
>
> Best,
> Kevin
>
>
> On Fri, Apr 12, 2019 at 6:04 PM Travers Ching <[hidden email]> wrote:
>>
>> Hi Inaki,
>>
>>> "Performant"... in terms of what. If the cost of copying the data
>>> predominates over the computation time, maybe you didn't need
>>> parallelization in the first place.
>>
>> Performant in terms of speed.  There's no copying in that example
>> using `mclapply` and so it is significantly faster than other
>> alternatives.
>>
>> It is a very simple and contrived example, but there are lots of
>> applications that depend on processing of large data and benefit from
>> multithreading.  For example, if I read in large sequencing data with
>> `Rsamtools` and want to check sequences for a set of motifs.
>>
>>> I don't see why mclapply could not be rewritten using PSOCK clusters.
>>
>> Because it would be much slower.
>>
>>> To implement copy-on-write, Linux overcommits virtual memory, and this
>>> is what causes scripts to break unexpectedly: everything works fine,
>>> until you change a small unimportant bit and... boom, out of memory.
>>> And in general, running forks in any GUI would cause things everywhere
>>> to break.
>>
>>> I'm not sure how did you setup that, but it does complete. Or do you
>>> mean that you ran out of memory? Then try replacing "x" with, e.g.,
>>> "x+1" in your mclapply example and see what happens (hint: save your
>>> work first).
>>
>> Yes, I meant that it ran out of memory on my desktop.  I understand
>> the limits, and it is not perfect because of the GUI issue you
>> mention, but I don't see a better alternative in terms of speed.
>>
>> Regards,
>> Travers
>>
>>
>>
>>
>> On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar <[hidden email]> wrote:
>>>
>>> On Fri, 12 Apr 2019 at 21:32, Travers Ching <[hidden email]> wrote:
>>>>
>>>> Just throwing my two cents in:
>>>>
>>>> I think removing/deprecating fork would be a bad idea for two reasons:
>>>>
>>>> 1) There are no performant alternatives
>>>
>>> "Performant"... in terms of what. If the cost of copying the data
>>> predominates over the computation time, maybe you didn't need
>>> parallelization in the first place.
>>>
>>>> 2) Removing fork would break existing workflows
>>>
>>> I don't see why mclapply could not be rewritten using PSOCK clusters.
>>> And as a side effect, this would enable those workflows on Windows,
>>> which doesn't support fork.
>>>
>>>> Even if replaced with something using the same interface (e.g., a
>>>> function that automatically detects variables to export as in the
>>>> amazing `future` package), the lack of copy-on-write functionality
>>>> would cause scripts everywhere to break.
>>>
>>> To implement copy-on-write, Linux overcommits virtual memory, and this
>>> is what causes scripts to break unexpectedly: everything works fine,
>>> until you change a small unimportant bit and... boom, out of memory.
>>> And in general, running forks in any GUI would cause things everywhere
>>> to break.
>>>
>>>> A simple example illustrating these two points:
>>>> `x <- 5e8; mclapply(1:24, sum, x, 8)`
>>>>
>>>> Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
>>>> does not complete.
>>>
>>> I'm not sure how did you setup that, but it does complete. Or do you
>>> mean that you ran out of memory? Then try replacing "x" with, e.g.,
>>> "x+1" in your mclapply example and see what happens (hint: save your
>>> work first).
>>>
>>> --
>>> Iñaki Úcar
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
In reply to this post by Kevin Ushey
On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>
> I think it's worth saying that mclapply() works as documented

Mostly, yes. But it says nothing about fork's copy-on-write and memory
overcommitment, and that this means that it may work nicely or fail
spectacularly depending on whether, e.g., you operate on a long
vector.

--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process.

Cheers,
Simon



> On Apr 13, 2019, at 06:05, Iñaki Ucar <[hidden email]> wrote:
>
> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>
>> I think it's worth saying that mclapply() works as documented
>
> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> overcommitment, and that this means that it may work nicely or fail
> spectacularly depending on whether, e.g., you operate on a long
> vector.
>
> --
> Iñaki Úcar
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
On Sat, 13 Apr 2019 at 18:41, Simon Urbanek <[hidden email]> wrote:
>
> Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process.

The point was that PSOCK fails and succeeds *consistently*,
independently of what you do with the input in the function provided.
I think that's a good property.

--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek


> On Apr 13, 2019, at 16:56, Iñaki Ucar <[hidden email]> wrote:
>
> On Sat, 13 Apr 2019 at 18:41, Simon Urbanek <[hidden email]> wrote:
>>
>> Sure, but that a completely bogus argument because in that case it would fail even more spectacularly with any other method like PSOCK because you would *have to* allocate n times as much memory so unlike mclapply it is guaranteed to fail. With mclapply it is simply much more efficient as it will share memory as long as possible. It is rather obvious that any new objects you create can no longer be shared as they now exist separately in each process.
>
> The point was that PSOCK fails and succeeds *consistently*,
> independently of what you do with the input in the function provided.
> I think that's a good property.
>

So does parallel. It is consistent. If you do things that use too much memory you will consistently fail. That's a pretty universal rule, there is nothing probabilistic about it. It makes no difference if it's PSOCK, multicore, or anything else.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Tomas Kalibera
In reply to this post by Iñaki Ucar
On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>> I think it's worth saying that mclapply() works as documented
> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> overcommitment, and that this means that it may work nicely or fail
> spectacularly depending on whether, e.g., you operate on a long
> vector.

R cannot possibly replicate documentation of the underlying operating
systems. It clearly says that fork() is used and readers who may not
know what fork() is need to learn it from external sources.
Copy-on-write is an elementary property of fork().

Reimplementing mclapply to use PSOCK does not make sense -- if someone
wants to write code that can be used both with PSOCK and FORK, there is
the cluster API in parallel for that.

Tomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Iñaki Ucar
On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:

>
> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> > On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >> I think it's worth saying that mclapply() works as documented
> > Mostly, yes. But it says nothing about fork's copy-on-write and memory
> > overcommitment, and that this means that it may work nicely or fail
> > spectacularly depending on whether, e.g., you operate on a long
> > vector.
>
> R cannot possibly replicate documentation of the underlying operating
> systems. It clearly says that fork() is used and readers who may not
> know what fork() is need to learn it from external sources.
> Copy-on-write is an elementary property of fork().

Just to be precise, copy-on-write is an optimization widely deployed
in most modern *nixes, particularly for the architectures in which R
usually runs. But it is not an elementary property; it is not even
possible without an MMU.

--
Iñaki Úcar

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Tomas Kalibera
On 4/15/19 11:02 AM, Iñaki Ucar wrote:

> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>> I think it's worth saying that mclapply() works as documented
>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>> overcommitment, and that this means that it may work nicely or fail
>>> spectacularly depending on whether, e.g., you operate on a long
>>> vector.
>> R cannot possibly replicate documentation of the underlying operating
>> systems. It clearly says that fork() is used and readers who may not
>> know what fork() is need to learn it from external sources.
>> Copy-on-write is an elementary property of fork().
> Just to be precise, copy-on-write is an optimization widely deployed
> in most modern *nixes, particularly for the architectures in which R
> usually runs. But it is not an elementary property; it is not even
> possible without an MMU.

Yes, old Unix systems without virtual memory had fork eagerly copying.
Not relevant today, and certainly not for systems that run R, but indeed
people interested in OS internals can look elsewhere for more precise
information.

Tomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Henrik Bengtsson-5
I'd like to pick up this thread started on 2019-04-11
(https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
Modulo all the other suggestions in this thread, would my proposal of
being able to disable forked processing via an option or an
environment variable make sense?  I've prototyped a working patch that
works like:

> options(fork.allowed = FALSE)
> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
[1] 14058 14058
> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
[1] 14058 14058
> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
[1] 14058.1 14058.2
> f <- parallel::mcparallel(Sys.getpid())
Error in allowFork(assert = TRUE) :
  Forked processing is not allowed per option ‘fork.allowed’ or
environment variable ‘R_FORK_ALLOWED’
> cl <- parallel::makeForkCluster(1L)
Error in allowFork(assert = TRUE) :
  Forked processing is not allowed per option ‘fork.allowed’ or
environment variable ‘R_FORK_ALLOWED’
>


The patch is:

Index: src/library/parallel/R/unix/forkCluster.R
===================================================================
--- src/library/parallel/R/unix/forkCluster.R (revision 77648)
+++ src/library/parallel/R/unix/forkCluster.R (working copy)
@@ -30,6 +30,7 @@

 newForkNode <- function(..., options = defaultClusterOptions, rank)
 {
+    allowFork(assert = TRUE)
     options <- addClusterOptions(options, list(...))
     outfile <- getClusterOption("outfile", options)
     port <- getClusterOption("port", options)
Index: src/library/parallel/R/unix/mclapply.R
===================================================================
--- src/library/parallel/R/unix/mclapply.R (revision 77648)
+++ src/library/parallel/R/unix/mclapply.R (working copy)
@@ -28,7 +28,7 @@
         stop("'mc.cores' must be >= 1")
     .check_ncores(cores)

-    if (isChild() && !isTRUE(mc.allow.recursive))
+    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
         return(lapply(X = X, FUN = FUN, ...))

     ## Follow lapply
Index: src/library/parallel/R/unix/mcparallel.R
===================================================================
--- src/library/parallel/R/unix/mcparallel.R (revision 77648)
+++ src/library/parallel/R/unix/mcparallel.R (working copy)
@@ -20,6 +20,7 @@

 mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
 {
+    allowFork(assert = TRUE)
     f <- mcfork(detached)
     env <- parent.frame()
     if (isTRUE(mc.set.seed)) mc.advance.stream()
Index: src/library/parallel/R/unix/pvec.R
===================================================================
--- src/library/parallel/R/unix/pvec.R (revision 77648)
+++ src/library/parallel/R/unix/pvec.R (working copy)
@@ -25,7 +25,7 @@

     cores <- as.integer(mc.cores)
     if(cores < 1L) stop("'mc.cores' must be >= 1")
-    if(cores == 1L) return(FUN(v, ...))
+    if(cores == 1L || !allowFork()) return(FUN(v, ...))
     .check_ncores(cores)

     if(mc.set.seed) mc.reset.stream()

with a new file src/library/parallel/R/unix/allowFork.R:

allowFork <- function(assert = FALSE) {
    value <- Sys.getenv("R_FORK_ALLOWED")
    if (nzchar(value)) {
        value <- switch(value,
           "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
           "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
            stop(gettextf("invalid environment variable value: %s==%s",
           "R_FORK_ALLOWED", value)))
value <- as.logical(value)
    } else {
        value <- TRUE
    }
    value <- getOption("fork.allowed", value)
    if (is.na(value)) {
        stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
    }
    if (assert && !value) {
      stop(gettextf("Forked processing is not allowed per option %s or
environment variable %s", sQuote("fork.allowed"),
sQuote("R_FORK_ALLOWED")))
    }
    value
}

/Henrik

On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:

>
> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> > On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>> I think it's worth saying that mclapply() works as documented
> >>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>> overcommitment, and that this means that it may work nicely or fail
> >>> spectacularly depending on whether, e.g., you operate on a long
> >>> vector.
> >> R cannot possibly replicate documentation of the underlying operating
> >> systems. It clearly says that fork() is used and readers who may not
> >> know what fork() is need to learn it from external sources.
> >> Copy-on-write is an elementary property of fork().
> > Just to be precise, copy-on-write is an optimization widely deployed
> > in most modern *nixes, particularly for the architectures in which R
> > usually runs. But it is not an elementary property; it is not even
> > possible without an MMU.
>
> Yes, old Unix systems without virtual memory had fork eagerly copying.
> Not relevant today, and certainly not for systems that run R, but indeed
> people interested in OS internals can look elsewhere for more precise
> information.
>
> Tomas
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Tomas Kalibera
On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> I'd like to pick up this thread started on 2019-04-11
> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> Modulo all the other suggestions in this thread, would my proposal of
> being able to disable forked processing via an option or an
> environment variable make sense?

I don't think R should be doing that. There are caveats with using fork,
and they are mentioned in the documentation of the parallel package, so
people can easily avoid functions that use it, and this all has been
discussed here recently.

If it is the case, we can expand the documentation in parallel package,
add a warning against the use of forking with RStudio, but for that I it
would be good to know at least why it is not working. From the github
issue I have the impression that it is not really known why, whether it
could be fixed, and if so, where. The same github issue reflects also
that some people want to use forking for performance reasons, and even
with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it
is just some race condition somewhere?

Tomas

>   I've prototyped a working patch that
> works like:
>
>> options(fork.allowed = FALSE)
>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> [1] 14058 14058
>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> [1] 14058 14058
>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> [1] 14058.1 14058.2
>> f <- parallel::mcparallel(Sys.getpid())
> Error in allowFork(assert = TRUE) :
>    Forked processing is not allowed per option ‘fork.allowed’ or
> environment variable ‘R_FORK_ALLOWED’
>> cl <- parallel::makeForkCluster(1L)
> Error in allowFork(assert = TRUE) :
>    Forked processing is not allowed per option ‘fork.allowed’ or
> environment variable ‘R_FORK_ALLOWED’
>
> The patch is:
>
> Index: src/library/parallel/R/unix/forkCluster.R
> ===================================================================
> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> @@ -30,6 +30,7 @@
>
>   newForkNode <- function(..., options = defaultClusterOptions, rank)
>   {
> +    allowFork(assert = TRUE)
>       options <- addClusterOptions(options, list(...))
>       outfile <- getClusterOption("outfile", options)
>       port <- getClusterOption("port", options)
> Index: src/library/parallel/R/unix/mclapply.R
> ===================================================================
> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> @@ -28,7 +28,7 @@
>           stop("'mc.cores' must be >= 1")
>       .check_ncores(cores)
>
> -    if (isChild() && !isTRUE(mc.allow.recursive))
> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>           return(lapply(X = X, FUN = FUN, ...))
>
>       ## Follow lapply
> Index: src/library/parallel/R/unix/mcparallel.R
> ===================================================================
> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> @@ -20,6 +20,7 @@
>
>   mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>   {
> +    allowFork(assert = TRUE)
>       f <- mcfork(detached)
>       env <- parent.frame()
>       if (isTRUE(mc.set.seed)) mc.advance.stream()
> Index: src/library/parallel/R/unix/pvec.R
> ===================================================================
> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> +++ src/library/parallel/R/unix/pvec.R (working copy)
> @@ -25,7 +25,7 @@
>
>       cores <- as.integer(mc.cores)
>       if(cores < 1L) stop("'mc.cores' must be >= 1")
> -    if(cores == 1L) return(FUN(v, ...))
> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>       .check_ncores(cores)
>
>       if(mc.set.seed) mc.reset.stream()
>
> with a new file src/library/parallel/R/unix/allowFork.R:
>
> allowFork <- function(assert = FALSE) {
>      value <- Sys.getenv("R_FORK_ALLOWED")
>      if (nzchar(value)) {
>          value <- switch(value,
>             "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>             "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>              stop(gettextf("invalid environment variable value: %s==%s",
>             "R_FORK_ALLOWED", value)))
> value <- as.logical(value)
>      } else {
>          value <- TRUE
>      }
>      value <- getOption("fork.allowed", value)
>      if (is.na(value)) {
>          stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>      }
>      if (assert && !value) {
>        stop(gettextf("Forked processing is not allowed per option %s or
> environment variable %s", sQuote("fork.allowed"),
> sQuote("R_FORK_ALLOWED")))
>      }
>      value
> }
>
> /Henrik
>
> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>> I think it's worth saying that mclapply() works as documented
>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>> vector.
>>>> R cannot possibly replicate documentation of the underlying operating
>>>> systems. It clearly says that fork() is used and readers who may not
>>>> know what fork() is need to learn it from external sources.
>>>> Copy-on-write is an elementary property of fork().
>>> Just to be precise, copy-on-write is an optimization widely deployed
>>> in most modern *nixes, particularly for the architectures in which R
>>> usually runs. But it is not an elementary property; it is not even
>>> possible without an MMU.
>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>> Not relevant today, and certainly not for systems that run R, but indeed
>> people interested in OS internals can look elsewhere for more precise
>> information.
>>
>> Tomas
>>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.

Cheers,
Simon

Sent from my iPhone

>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
>>
>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
>> I'd like to pick up this thread started on 2019-04-11
>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
>> Modulo all the other suggestions in this thread, would my proposal of
>> being able to disable forked processing via an option or an
>> environment variable make sense?
>
> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
>
> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
>
> Tomas
>
>> I've prototyped a working patch that
>> works like:
>>> options(fork.allowed = FALSE)
>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
>> [1] 14058 14058
>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
>> [1] 14058 14058
>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
>> [1] 14058.1 14058.2
>>> f <- parallel::mcparallel(Sys.getpid())
>> Error in allowFork(assert = TRUE) :
>>  Forked processing is not allowed per option ‘fork.allowed’ or
>> environment variable ‘R_FORK_ALLOWED’
>>> cl <- parallel::makeForkCluster(1L)
>> Error in allowFork(assert = TRUE) :
>>  Forked processing is not allowed per option ‘fork.allowed’ or
>> environment variable ‘R_FORK_ALLOWED’
>> The patch is:
>> Index: src/library/parallel/R/unix/forkCluster.R
>> ===================================================================
>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
>> @@ -30,6 +30,7 @@
>> newForkNode <- function(..., options = defaultClusterOptions, rank)
>> {
>> +    allowFork(assert = TRUE)
>>     options <- addClusterOptions(options, list(...))
>>     outfile <- getClusterOption("outfile", options)
>>     port <- getClusterOption("port", options)
>> Index: src/library/parallel/R/unix/mclapply.R
>> ===================================================================
>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
>> @@ -28,7 +28,7 @@
>>         stop("'mc.cores' must be >= 1")
>>     .check_ncores(cores)
>> -    if (isChild() && !isTRUE(mc.allow.recursive))
>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>>         return(lapply(X = X, FUN = FUN, ...))
>>     ## Follow lapply
>> Index: src/library/parallel/R/unix/mcparallel.R
>> ===================================================================
>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
>> @@ -20,6 +20,7 @@
>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>> {
>> +    allowFork(assert = TRUE)
>>     f <- mcfork(detached)
>>     env <- parent.frame()
>>     if (isTRUE(mc.set.seed)) mc.advance.stream()
>> Index: src/library/parallel/R/unix/pvec.R
>> ===================================================================
>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
>> +++ src/library/parallel/R/unix/pvec.R (working copy)
>> @@ -25,7 +25,7 @@
>>     cores <- as.integer(mc.cores)
>>     if(cores < 1L) stop("'mc.cores' must be >= 1")
>> -    if(cores == 1L) return(FUN(v, ...))
>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>>     .check_ncores(cores)
>>     if(mc.set.seed) mc.reset.stream()
>> with a new file src/library/parallel/R/unix/allowFork.R:
>> allowFork <- function(assert = FALSE) {
>>    value <- Sys.getenv("R_FORK_ALLOWED")
>>    if (nzchar(value)) {
>>        value <- switch(value,
>>           "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>>           "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>>            stop(gettextf("invalid environment variable value: %s==%s",
>>           "R_FORK_ALLOWED", value)))
>> value <- as.logical(value)
>>    } else {
>>        value <- TRUE
>>    }
>>    value <- getOption("fork.allowed", value)
>>    if (is.na(value)) {
>>        stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>>    }
>>    if (assert && !value) {
>>      stop(gettextf("Forked processing is not allowed per option %s or
>> environment variable %s", sQuote("fork.allowed"),
>> sQuote("R_FORK_ALLOWED")))
>>    }
>>    value
>> }
>> /Henrik
>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>>> I think it's worth saying that mclapply() works as documented
>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>>> vector.
>>>>> R cannot possibly replicate documentation of the underlying operating
>>>>> systems. It clearly says that fork() is used and readers who may not
>>>>> know what fork() is need to learn it from external sources.
>>>>> Copy-on-write is an elementary property of fork().
>>>> Just to be precise, copy-on-write is an optimization widely deployed
>>>> in most modern *nixes, particularly for the architectures in which R
>>>> usually runs. But it is not an elementary property; it is not even
>>>> possible without an MMU.
>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>>> Not relevant today, and certainly not for systems that run R, but indeed
>>> people interested in OS internals can look elsewhere for more precise
>>> information.
>>> Tomas
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Henrik Bengtsson-5
The RStudio GUI was just one example.  AFAIK, and please correct me if
I'm wrong, another example is where multi-threaded code is used in
forked processing and that's sometimes unstable.  Yes another, which
might be multi-thread related or not, is
https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:

res <- parallel::mclapply(urls, function(url) {
  download.file(url, basename(url))
})

That was reported to fail on macOS with the default method="libcurl"
but not for method="curl" or method="wget".

Further documentation is needed and would help but I don't believe
it's sufficient to solve everyday problems.  The argument for
introducing an option/env var to disable forking is to give the end
user a quick workaround for newly introduced bugs.  Neither the
develop nor the end user have full control of the R package stack,
which is always in flux.  For instance, above mclapply() code might
have been in a package on CRAN and then all of a sudden
method="libcurl" became the new default in base R.  The above
mclapply() code is now buggy on macOS, and not necessarily caught by
CRAN checks.  The package developer might not notice this because they
are on Linux or Windows.  It can take a very long time before this
problem is even noticed and even further before it is tracked down and
fixed.   Similarly, as more and more code turn to native code and it
becomes easier and easier to implement multi-threading, more and more
of these bugs across package dependencies risk sneaking in the
backdoor wherever forked processing is in place.

For the end user, but also higher-up upstream package developers, the
quickest workaround would be disable forking.  If you're conservative,
you could even disable it all of your R processing.  Being able to
quickly disable forking will also provide a mechanism for quickly
testing the hypothesis that forking is the underlying problem, i.e.
"Please retry with options(fork.allowed = FALSE)" will become handy
for troubleshooting.

/Henrik

On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
<[hidden email]> wrote:

>
> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
>
> Cheers,
> Simon
>
> Sent from my iPhone
>
> >> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
> >>
> >> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> >> I'd like to pick up this thread started on 2019-04-11
> >> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> >> Modulo all the other suggestions in this thread, would my proposal of
> >> being able to disable forked processing via an option or an
> >> environment variable make sense?
> >
> > I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
> >
> > If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
> >
> > Tomas
> >
> >> I've prototyped a working patch that
> >> works like:
> >>> options(fork.allowed = FALSE)
> >>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> >> [1] 14058 14058
> >>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> >> [1] 14058 14058
> >>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> >> [1] 14058.1 14058.2
> >>> f <- parallel::mcparallel(Sys.getpid())
> >> Error in allowFork(assert = TRUE) :
> >>  Forked processing is not allowed per option ‘fork.allowed’ or
> >> environment variable ‘R_FORK_ALLOWED’
> >>> cl <- parallel::makeForkCluster(1L)
> >> Error in allowFork(assert = TRUE) :
> >>  Forked processing is not allowed per option ‘fork.allowed’ or
> >> environment variable ‘R_FORK_ALLOWED’
> >> The patch is:
> >> Index: src/library/parallel/R/unix/forkCluster.R
> >> ===================================================================
> >> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> >> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> >> @@ -30,6 +30,7 @@
> >> newForkNode <- function(..., options = defaultClusterOptions, rank)
> >> {
> >> +    allowFork(assert = TRUE)
> >>     options <- addClusterOptions(options, list(...))
> >>     outfile <- getClusterOption("outfile", options)
> >>     port <- getClusterOption("port", options)
> >> Index: src/library/parallel/R/unix/mclapply.R
> >> ===================================================================
> >> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> >> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> >> @@ -28,7 +28,7 @@
> >>         stop("'mc.cores' must be >= 1")
> >>     .check_ncores(cores)
> >> -    if (isChild() && !isTRUE(mc.allow.recursive))
> >> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
> >>         return(lapply(X = X, FUN = FUN, ...))
> >>     ## Follow lapply
> >> Index: src/library/parallel/R/unix/mcparallel.R
> >> ===================================================================
> >> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> >> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> >> @@ -20,6 +20,7 @@
> >> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> >> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
> >> {
> >> +    allowFork(assert = TRUE)
> >>     f <- mcfork(detached)
> >>     env <- parent.frame()
> >>     if (isTRUE(mc.set.seed)) mc.advance.stream()
> >> Index: src/library/parallel/R/unix/pvec.R
> >> ===================================================================
> >> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> >> +++ src/library/parallel/R/unix/pvec.R (working copy)
> >> @@ -25,7 +25,7 @@
> >>     cores <- as.integer(mc.cores)
> >>     if(cores < 1L) stop("'mc.cores' must be >= 1")
> >> -    if(cores == 1L) return(FUN(v, ...))
> >> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
> >>     .check_ncores(cores)
> >>     if(mc.set.seed) mc.reset.stream()
> >> with a new file src/library/parallel/R/unix/allowFork.R:
> >> allowFork <- function(assert = FALSE) {
> >>    value <- Sys.getenv("R_FORK_ALLOWED")
> >>    if (nzchar(value)) {
> >>        value <- switch(value,
> >>           "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
> >>           "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
> >>            stop(gettextf("invalid environment variable value: %s==%s",
> >>           "R_FORK_ALLOWED", value)))
> >> value <- as.logical(value)
> >>    } else {
> >>        value <- TRUE
> >>    }
> >>    value <- getOption("fork.allowed", value)
> >>    if (is.na(value)) {
> >>        stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
> >>    }
> >>    if (assert && !value) {
> >>      stop(gettextf("Forked processing is not allowed per option %s or
> >> environment variable %s", sQuote("fork.allowed"),
> >> sQuote("R_FORK_ALLOWED")))
> >>    }
> >>    value
> >> }
> >> /Henrik
> >>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
> >>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> >>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>>>>> I think it's worth saying that mclapply() works as documented
> >>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>>>>> overcommitment, and that this means that it may work nicely or fail
> >>>>>> spectacularly depending on whether, e.g., you operate on a long
> >>>>>> vector.
> >>>>> R cannot possibly replicate documentation of the underlying operating
> >>>>> systems. It clearly says that fork() is used and readers who may not
> >>>>> know what fork() is need to learn it from external sources.
> >>>>> Copy-on-write is an elementary property of fork().
> >>>> Just to be precise, copy-on-write is an optimization widely deployed
> >>>> in most modern *nixes, particularly for the architectures in which R
> >>>> usually runs. But it is not an elementary property; it is not even
> >>>> possible without an MMU.
> >>> Yes, old Unix systems without virtual memory had fork eagerly copying.
> >>> Not relevant today, and certainly not for systems that run R, but indeed
> >>> people interested in OS internals can look elsewhere for more precise
> >>> information.
> >>> Tomas
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Simon Urbanek
Henrik,

the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries. That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.

Cheers,
Simon
 


> On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
>
> The RStudio GUI was just one example.  AFAIK, and please correct me if
> I'm wrong, another example is where multi-threaded code is used in
> forked processing and that's sometimes unstable.  Yes another, which
> might be multi-thread related or not, is
> https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
>
> res <- parallel::mclapply(urls, function(url) {
>  download.file(url, basename(url))
> })
>
> That was reported to fail on macOS with the default method="libcurl"
> but not for method="curl" or method="wget".
>
> Further documentation is needed and would help but I don't believe
> it's sufficient to solve everyday problems.  The argument for
> introducing an option/env var to disable forking is to give the end
> user a quick workaround for newly introduced bugs.  Neither the
> develop nor the end user have full control of the R package stack,
> which is always in flux.  For instance, above mclapply() code might
> have been in a package on CRAN and then all of a sudden
> method="libcurl" became the new default in base R.  The above
> mclapply() code is now buggy on macOS, and not necessarily caught by
> CRAN checks.  The package developer might not notice this because they
> are on Linux or Windows.  It can take a very long time before this
> problem is even noticed and even further before it is tracked down and
> fixed.   Similarly, as more and more code turn to native code and it
> becomes easier and easier to implement multi-threading, more and more
> of these bugs across package dependencies risk sneaking in the
> backdoor wherever forked processing is in place.
>
> For the end user, but also higher-up upstream package developers, the
> quickest workaround would be disable forking.  If you're conservative,
> you could even disable it all of your R processing.  Being able to
> quickly disable forking will also provide a mechanism for quickly
> testing the hypothesis that forking is the underlying problem, i.e.
> "Please retry with options(fork.allowed = FALSE)" will become handy
> for troubleshooting.
>
> /Henrik
>
> On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
> <[hidden email]> wrote:
>>
>> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
>>
>> Cheers,
>> Simon
>>
>> Sent from my iPhone
>>
>>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
>>>>
>>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
>>>> I'd like to pick up this thread started on 2019-04-11
>>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
>>>> Modulo all the other suggestions in this thread, would my proposal of
>>>> being able to disable forked processing via an option or an
>>>> environment variable make sense?
>>>
>>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
>>>
>>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
>>>
>>> Tomas
>>>
>>>> I've prototyped a working patch that
>>>> works like:
>>>>> options(fork.allowed = FALSE)
>>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
>>>> [1] 14058 14058
>>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
>>>> [1] 14058 14058
>>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
>>>> [1] 14058.1 14058.2
>>>>> f <- parallel::mcparallel(Sys.getpid())
>>>> Error in allowFork(assert = TRUE) :
>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>> environment variable ‘R_FORK_ALLOWED’
>>>>> cl <- parallel::makeForkCluster(1L)
>>>> Error in allowFork(assert = TRUE) :
>>>> Forked processing is not allowed per option ‘fork.allowed’ or
>>>> environment variable ‘R_FORK_ALLOWED’
>>>> The patch is:
>>>> Index: src/library/parallel/R/unix/forkCluster.R
>>>> ===================================================================
>>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
>>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
>>>> @@ -30,6 +30,7 @@
>>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
>>>> {
>>>> +    allowFork(assert = TRUE)
>>>>    options <- addClusterOptions(options, list(...))
>>>>    outfile <- getClusterOption("outfile", options)
>>>>    port <- getClusterOption("port", options)
>>>> Index: src/library/parallel/R/unix/mclapply.R
>>>> ===================================================================
>>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
>>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
>>>> @@ -28,7 +28,7 @@
>>>>        stop("'mc.cores' must be >= 1")
>>>>    .check_ncores(cores)
>>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
>>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
>>>>        return(lapply(X = X, FUN = FUN, ...))
>>>>    ## Follow lapply
>>>> Index: src/library/parallel/R/unix/mcparallel.R
>>>> ===================================================================
>>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
>>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
>>>> @@ -20,6 +20,7 @@
>>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
>>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
>>>> {
>>>> +    allowFork(assert = TRUE)
>>>>    f <- mcfork(detached)
>>>>    env <- parent.frame()
>>>>    if (isTRUE(mc.set.seed)) mc.advance.stream()
>>>> Index: src/library/parallel/R/unix/pvec.R
>>>> ===================================================================
>>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
>>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
>>>> @@ -25,7 +25,7 @@
>>>>    cores <- as.integer(mc.cores)
>>>>    if(cores < 1L) stop("'mc.cores' must be >= 1")
>>>> -    if(cores == 1L) return(FUN(v, ...))
>>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
>>>>    .check_ncores(cores)
>>>>    if(mc.set.seed) mc.reset.stream()
>>>> with a new file src/library/parallel/R/unix/allowFork.R:
>>>> allowFork <- function(assert = FALSE) {
>>>>   value <- Sys.getenv("R_FORK_ALLOWED")
>>>>   if (nzchar(value)) {
>>>>       value <- switch(value,
>>>>          "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
>>>>          "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
>>>>           stop(gettextf("invalid environment variable value: %s==%s",
>>>>          "R_FORK_ALLOWED", value)))
>>>> value <- as.logical(value)
>>>>   } else {
>>>>       value <- TRUE
>>>>   }
>>>>   value <- getOption("fork.allowed", value)
>>>>   if (is.na(value)) {
>>>>       stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
>>>>   }
>>>>   if (assert && !value) {
>>>>     stop(gettextf("Forked processing is not allowed per option %s or
>>>> environment variable %s", sQuote("fork.allowed"),
>>>> sQuote("R_FORK_ALLOWED")))
>>>>   }
>>>>   value
>>>> }
>>>> /Henrik
>>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
>>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
>>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
>>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
>>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
>>>>>>>>> I think it's worth saying that mclapply() works as documented
>>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
>>>>>>>> overcommitment, and that this means that it may work nicely or fail
>>>>>>>> spectacularly depending on whether, e.g., you operate on a long
>>>>>>>> vector.
>>>>>>> R cannot possibly replicate documentation of the underlying operating
>>>>>>> systems. It clearly says that fork() is used and readers who may not
>>>>>>> know what fork() is need to learn it from external sources.
>>>>>>> Copy-on-write is an elementary property of fork().
>>>>>> Just to be precise, copy-on-write is an optimization widely deployed
>>>>>> in most modern *nixes, particularly for the architectures in which R
>>>>>> usually runs. But it is not an elementary property; it is not even
>>>>>> possible without an MMU.
>>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
>>>>> Not relevant today, and certainly not for systems that run R, but indeed
>>>>> people interested in OS internals can look elsewhere for more precise
>>>>> information.
>>>>> Tomas
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

Gábor Csárdi
On Fri, Jan 10, 2020 at 7:23 PM Simon Urbanek
<[hidden email]> wrote:
>
> Henrik,
>
> the example from the post works just fine in CRAN R for me - the post was about homebrew build so it's conceivably a bug in their libraries.

I think it works now, because Apple switched to a different SSL
library for libcurl. It usually crashes or fails on older macOS
versions, with the CRAN build of R as well.

It is not a bug in any library, it is just that macOS does not support
fork() without an immediate exec().

In general, any code that calls the macOS system libraries might
crash. (Except for CoreFoundation, which seems to be fine, but AFAIR
there is no guarantee for that, either.)

You get crashes in the terminal as well, without multithreading. E.g.
the keyring package links for the Security library on macOS, so you
get:

❯ R --vanilla -q
> .libPaths("~/R/3.6")
> keyring::key_list()[1:2,]
        service                                                  username
1    CommCenter                             kEntitlementsUniqueIDCacheKey
2           ids                                   identity-rsa-public-key
> parallel::mclapply(1:10, function(i) keyring::key_list()[1:2,])

 *** caught segfault ***
address 0x110, cause 'memory not mapped'

 *** caught segfault ***
address 0x110, cause 'memory not mapped'

AFAICT only Apple can do anything about this, and they won't.

Gabor

> That's exactly why I was proposing a more general solution where you can simply define a function in user-space that will issue a warning or stop on fork, it doesn't have to be part of core R, there are other packages that use fork() as well, so what I proposed is much safer than hacking the parallel package.
>
> Cheers,
> Simon
>
>
>
> > On Jan 10, 2020, at 10:58 AM, Henrik Bengtsson <[hidden email]> wrote:
> >
> > The RStudio GUI was just one example.  AFAIK, and please correct me if
> > I'm wrong, another example is where multi-threaded code is used in
> > forked processing and that's sometimes unstable.  Yes another, which
> > might be multi-thread related or not, is
> > https://stat.ethz.ch/pipermail/r-devel/2018-September/076845.html:
> >
> > res <- parallel::mclapply(urls, function(url) {
> >  download.file(url, basename(url))
> > })
> >
> > That was reported to fail on macOS with the default method="libcurl"
> > but not for method="curl" or method="wget".
> >
> > Further documentation is needed and would help but I don't believe
> > it's sufficient to solve everyday problems.  The argument for
> > introducing an option/env var to disable forking is to give the end
> > user a quick workaround for newly introduced bugs.  Neither the
> > develop nor the end user have full control of the R package stack,
> > which is always in flux.  For instance, above mclapply() code might
> > have been in a package on CRAN and then all of a sudden
> > method="libcurl" became the new default in base R.  The above
> > mclapply() code is now buggy on macOS, and not necessarily caught by
> > CRAN checks.  The package developer might not notice this because they
> > are on Linux or Windows.  It can take a very long time before this
> > problem is even noticed and even further before it is tracked down and
> > fixed.   Similarly, as more and more code turn to native code and it
> > becomes easier and easier to implement multi-threading, more and more
> > of these bugs across package dependencies risk sneaking in the
> > backdoor wherever forked processing is in place.
> >
> > For the end user, but also higher-up upstream package developers, the
> > quickest workaround would be disable forking.  If you're conservative,
> > you could even disable it all of your R processing.  Being able to
> > quickly disable forking will also provide a mechanism for quickly
> > testing the hypothesis that forking is the underlying problem, i.e.
> > "Please retry with options(fork.allowed = FALSE)" will become handy
> > for troubleshooting.
> >
> > /Henrik
> >
> > On Fri, Jan 10, 2020 at 5:31 AM Simon Urbanek
> > <[hidden email]> wrote:
> >>
> >> If I understand the thread correctly this is an RStudio issue and I would suggest that the developers consider using pthread_atfork() so RStudio can handle forking as they deem fit (bail out with an error or make RStudio work).  Note that in principle the functionality requested here can be easily implemented in a package so R doesn’t need to be modified.
> >>
> >> Cheers,
> >> Simon
> >>
> >> Sent from my iPhone
> >>
> >>>> On Jan 10, 2020, at 04:34, Tomas Kalibera <[hidden email]> wrote:
> >>>>
> >>>> On 1/10/20 7:33 AM, Henrik Bengtsson wrote:
> >>>> I'd like to pick up this thread started on 2019-04-11
> >>>> (https://hypatia.math.ethz.ch/pipermail/r-devel/2019-April/077632.html).
> >>>> Modulo all the other suggestions in this thread, would my proposal of
> >>>> being able to disable forked processing via an option or an
> >>>> environment variable make sense?
> >>>
> >>> I don't think R should be doing that. There are caveats with using fork, and they are mentioned in the documentation of the parallel package, so people can easily avoid functions that use it, and this all has been discussed here recently.
> >>>
> >>> If it is the case, we can expand the documentation in parallel package, add a warning against the use of forking with RStudio, but for that I it would be good to know at least why it is not working. From the github issue I have the impression that it is not really known why, whether it could be fixed, and if so, where. The same github issue reflects also that some people want to use forking for performance reasons, and even with RStudio, at least on Linux. Perhaps it could be fixed? Perhaps it is just some race condition somewhere?
> >>>
> >>> Tomas
> >>>
> >>>> I've prototyped a working patch that
> >>>> works like:
> >>>>> options(fork.allowed = FALSE)
> >>>>> unlist(parallel::mclapply(1:2, FUN = function(x) Sys.getpid()))
> >>>> [1] 14058 14058
> >>>>> parallel::mcmapply(1:2, FUN = function(x) Sys.getpid())
> >>>> [1] 14058 14058
> >>>>> parallel::pvec(1:2, FUN = function(x) Sys.getpid() + x/10)
> >>>> [1] 14058.1 14058.2
> >>>>> f <- parallel::mcparallel(Sys.getpid())
> >>>> Error in allowFork(assert = TRUE) :
> >>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>> environment variable ‘R_FORK_ALLOWED’
> >>>>> cl <- parallel::makeForkCluster(1L)
> >>>> Error in allowFork(assert = TRUE) :
> >>>> Forked processing is not allowed per option ‘fork.allowed’ or
> >>>> environment variable ‘R_FORK_ALLOWED’
> >>>> The patch is:
> >>>> Index: src/library/parallel/R/unix/forkCluster.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/forkCluster.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/forkCluster.R (working copy)
> >>>> @@ -30,6 +30,7 @@
> >>>> newForkNode <- function(..., options = defaultClusterOptions, rank)
> >>>> {
> >>>> +    allowFork(assert = TRUE)
> >>>>    options <- addClusterOptions(options, list(...))
> >>>>    outfile <- getClusterOption("outfile", options)
> >>>>    port <- getClusterOption("port", options)
> >>>> Index: src/library/parallel/R/unix/mclapply.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/mclapply.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/mclapply.R (working copy)
> >>>> @@ -28,7 +28,7 @@
> >>>>        stop("'mc.cores' must be >= 1")
> >>>>    .check_ncores(cores)
> >>>> -    if (isChild() && !isTRUE(mc.allow.recursive))
> >>>> +    if (!allowFork() || (isChild() && !isTRUE(mc.allow.recursive)))
> >>>>        return(lapply(X = X, FUN = FUN, ...))
> >>>>    ## Follow lapply
> >>>> Index: src/library/parallel/R/unix/mcparallel.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/mcparallel.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/mcparallel.R (working copy)
> >>>> @@ -20,6 +20,7 @@
> >>>> mcparallel <- function(expr, name, mc.set.seed = TRUE, silent =
> >>>> FALSE, mc.affinity = NULL, mc.interactive = FALSE, detached = FALSE)
> >>>> {
> >>>> +    allowFork(assert = TRUE)
> >>>>    f <- mcfork(detached)
> >>>>    env <- parent.frame()
> >>>>    if (isTRUE(mc.set.seed)) mc.advance.stream()
> >>>> Index: src/library/parallel/R/unix/pvec.R
> >>>> ===================================================================
> >>>> --- src/library/parallel/R/unix/pvec.R (revision 77648)
> >>>> +++ src/library/parallel/R/unix/pvec.R (working copy)
> >>>> @@ -25,7 +25,7 @@
> >>>>    cores <- as.integer(mc.cores)
> >>>>    if(cores < 1L) stop("'mc.cores' must be >= 1")
> >>>> -    if(cores == 1L) return(FUN(v, ...))
> >>>> +    if(cores == 1L || !allowFork()) return(FUN(v, ...))
> >>>>    .check_ncores(cores)
> >>>>    if(mc.set.seed) mc.reset.stream()
> >>>> with a new file src/library/parallel/R/unix/allowFork.R:
> >>>> allowFork <- function(assert = FALSE) {
> >>>>   value <- Sys.getenv("R_FORK_ALLOWED")
> >>>>   if (nzchar(value)) {
> >>>>       value <- switch(value,
> >>>>          "1"=, "TRUE"=, "true"=, "True"=, "yes"=, "Yes"= TRUE,
> >>>>          "0"=, "FALSE"=,"false"=,"False"=, "no"=, "No" = FALSE,
> >>>>           stop(gettextf("invalid environment variable value: %s==%s",
> >>>>          "R_FORK_ALLOWED", value)))
> >>>> value <- as.logical(value)
> >>>>   } else {
> >>>>       value <- TRUE
> >>>>   }
> >>>>   value <- getOption("fork.allowed", value)
> >>>>   if (is.na(value)) {
> >>>>       stop(gettextf("invalid option value: %s==%s", "fork.allowed", value))
> >>>>   }
> >>>>   if (assert && !value) {
> >>>>     stop(gettextf("Forked processing is not allowed per option %s or
> >>>> environment variable %s", sQuote("fork.allowed"),
> >>>> sQuote("R_FORK_ALLOWED")))
> >>>>   }
> >>>>   value
> >>>> }
> >>>> /Henrik
> >>>>> On Mon, Apr 15, 2019 at 3:12 AM Tomas Kalibera <[hidden email]> wrote:
> >>>>> On 4/15/19 11:02 AM, Iñaki Ucar wrote:
> >>>>>> On Mon, 15 Apr 2019 at 08:44, Tomas Kalibera <[hidden email]> wrote:
> >>>>>>> On 4/13/19 12:05 PM, Iñaki Ucar wrote:
> >>>>>>>> On Sat, 13 Apr 2019 at 03:51, Kevin Ushey <[hidden email]> wrote:
> >>>>>>>>> I think it's worth saying that mclapply() works as documented
> >>>>>>>> Mostly, yes. But it says nothing about fork's copy-on-write and memory
> >>>>>>>> overcommitment, and that this means that it may work nicely or fail
> >>>>>>>> spectacularly depending on whether, e.g., you operate on a long
> >>>>>>>> vector.
> >>>>>>> R cannot possibly replicate documentation of the underlying operating
> >>>>>>> systems. It clearly says that fork() is used and readers who may not
> >>>>>>> know what fork() is need to learn it from external sources.
> >>>>>>> Copy-on-write is an elementary property of fork().
> >>>>>> Just to be precise, copy-on-write is an optimization widely deployed
> >>>>>> in most modern *nixes, particularly for the architectures in which R
> >>>>>> usually runs. But it is not an elementary property; it is not even
> >>>>>> possible without an MMU.
> >>>>> Yes, old Unix systems without virtual memory had fork eagerly copying.
> >>>>> Not relevant today, and certainly not for systems that run R, but indeed
> >>>>> people interested in OS internals can look elsewhere for more precise
> >>>>> information.
> >>>>> Tomas
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
12