parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson-5
BACKGROUND:

While troubleshooting random, occasionally occurring, errors from
parallel::makePSOCKcluster("localhost", port = 11000);

Error in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, :
    cannot open the connection

I had another look at parallel:::newPSOCKnode(), which is used
internally to set up each background worker.  It is designed to, first
launch the background worker as:

  system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
wait = FALSE)

which immediately tries to connect to a socket on localhost:11000 with
timeout.  Immediately after the master launched the above (without
waiting), it will set up the connection waiting for the connect from
the background worker:

    con <- socketConnection("localhost", port = 11000, server = TRUE,
        blocking = TRUE, open = "a+b", timeout = timeout)


ISSUE:

If we emulate the above process, and remove the OUT=/dev/null such
that we can see the output produces by the worker, as:

setup <- function(delay = 0) {
  system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
  Sys.sleep(delay)
  socketConnection("localhost", port = 11000, server = TRUE, blocking
= TRUE, open = "a+b", timeout = 20)
}

doing:

> con <- setup(0)
starting worker pid=24983 on localhost:11000 at 18:44:30.087

will most likely work, but adding a delay:

> con <- setup(5)
starting worker pid=25099 on localhost:11000 at 18:45:23.617
Warning in socketConnection(master, port = port, blocking = TRUE, open
= "a+b",  :
  localhost:11000 cannot be opened
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
  cannot open the connection
Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
socketConnection

will produce an *instant* error on the worker, and before master opens
the server socket.  Eventually, master will produce the originally
observed error:

Error in socketConnection("localhost", port = 11000, server = TRUE,
blocking = TRUE,  :
  cannot open the connection

In other words, if the master fails to setup socketConnection()
*before* the background workers attempts to connect, it all fails.
Such a delay may happen for instance when there is a large CPU load on
the test machine.

Is this the above bug?

/Henrik

PS. The background is that I, very occasionally, observe R CMD check
error on the above (on CRAN and elsewhere) when testing my future
package. The error always go away when retested. This far I've though
this is due to port clashes (since the port is random selected in
[11000:11999]) and accepted that it happens.  However, after
discovering the above, it could be due to the worker launching "too
soon".

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson-5
I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
socketConnection() as a character, i.e. there's a missing timeout <-
as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
as.logical(value):

> parallel:::.slaveRSOCK
function ()
{
    makeSOCKmaster <- function(master, port, timeout, useXDR) {
        port <- as.integer(port)
        con <- socketConnection(master, port = port, blocking = TRUE,
            open = "a+b", timeout = timeout)
        structure(list(con = con), class = if (useXDR)
            "SOCKnode"
        else "SOCK0node")
    }
    master <- "localhost"
    port <- NA_integer_
    outfile <- Sys.getenv("R_SNOW_OUTFILE")
    methods <- TRUE
    useXDR <- TRUE
    for (a in commandArgs(TRUE)) {
        pos <- regexpr("=", a)
        name <- substr(a, 1L, pos - 1L)
        value <- substr(a, pos + 1L, nchar(a))
        switch(name, MASTER = {
            master <- value
        }, PORT = {
            port <- value
        }, OUT = {
            outfile <- value
        }, TIMEOUT = {
            timeout <- value
        }, XDR = {
            useXDR <- as.logical(value)
        })
    }
    if (is.na(port))
        stop("PORT must be specified")
    sinkWorkerOutput(outfile)
    msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
        paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
    cat(msg)
    slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
}
<bytecode: 0x4bd4b58>
<environment: namespace:parallel>

Yet, fix that does *not* seem to change anything.

/Henrik

On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
<[hidden email]> wrote:

> BACKGROUND:
>
> While troubleshooting random, occasionally occurring, errors from
> parallel::makePSOCKcluster("localhost", port = 11000);
>
> Error in socketConnection("localhost", port = port, server = TRUE,
> blocking = TRUE, :
>     cannot open the connection
>
> I had another look at parallel:::newPSOCKnode(), which is used
> internally to set up each background worker.  It is designed to, first
> launch the background worker as:
>
>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
> wait = FALSE)
>
> which immediately tries to connect to a socket on localhost:11000 with
> timeout.  Immediately after the master launched the above (without
> waiting), it will set up the connection waiting for the connect from
> the background worker:
>
>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>         blocking = TRUE, open = "a+b", timeout = timeout)
>
>
> ISSUE:
>
> If we emulate the above process, and remove the OUT=/dev/null such
> that we can see the output produces by the worker, as:
>
> setup <- function(delay = 0) {
>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>   Sys.sleep(delay)
>   socketConnection("localhost", port = 11000, server = TRUE, blocking
> = TRUE, open = "a+b", timeout = 20)
> }
>
> doing:
>
>> con <- setup(0)
> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>
> will most likely work, but adding a delay:
>
>> con <- setup(5)
> starting worker pid=25099 on localhost:11000 at 18:45:23.617
> Warning in socketConnection(master, port = port, blocking = TRUE, open
> = "a+b",  :
>   localhost:11000 cannot be opened
> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
>   cannot open the connection
> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
> socketConnection
>
> will produce an *instant* error on the worker, and before master opens
> the server socket.  Eventually, master will produce the originally
> observed error:
>
> Error in socketConnection("localhost", port = 11000, server = TRUE,
> blocking = TRUE,  :
>   cannot open the connection
>
> In other words, if the master fails to setup socketConnection()
> *before* the background workers attempts to connect, it all fails.
> Such a delay may happen for instance when there is a large CPU load on
> the test machine.
>
> Is this the above bug?
>
> /Henrik
>
> PS. The background is that I, very occasionally, observe R CMD check
> error on the above (on CRAN and elsewhere) when testing my future
> package. The error always go away when retested. This far I've though
> this is due to port clashes (since the port is random selected in
> [11000:11999]) and accepted that it happens.  However, after
> discovering the above, it could be due to the worker launching "too
> soon".
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.4 LTS
>
> Matrix products: default
> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson-5
A solution is to have parallel:::.slaveRSOCK() attempt to connect
multiple times before failing, e.g.

    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
= 10L, interval = 1.0) {
        port <- as.integer(port)
        for (i in seq_len(maxTries)) {
          con <- tryCatch({
            socketConnection(master, port = port, blocking = TRUE,
                                    open = "a+b", timeout = timeout)
          }, error = identity)
          if (inherits(con, "connection")) break
          Sys.sleep(interval)
        }
        if (inherits(con, "error")) stop(con)
        structure(list(con = con), class = if (useXDR)
            "SOCKnode"
        else "SOCK0node")
    }

One could set 'maxTries' and 'interval' via commandArgs() like what is
done for the other arguments.

I'm happy to submit an SVN patch if R core thinks this is an
acceptable solution.

/Henrik

On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
<[hidden email]> wrote:

> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
> socketConnection() as a character, i.e. there's a missing timeout <-
> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
> as.logical(value):
>
>> parallel:::.slaveRSOCK
> function ()
> {
>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>         port <- as.integer(port)
>         con <- socketConnection(master, port = port, blocking = TRUE,
>             open = "a+b", timeout = timeout)
>         structure(list(con = con), class = if (useXDR)
>             "SOCKnode"
>         else "SOCK0node")
>     }
>     master <- "localhost"
>     port <- NA_integer_
>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>     methods <- TRUE
>     useXDR <- TRUE
>     for (a in commandArgs(TRUE)) {
>         pos <- regexpr("=", a)
>         name <- substr(a, 1L, pos - 1L)
>         value <- substr(a, pos + 1L, nchar(a))
>         switch(name, MASTER = {
>             master <- value
>         }, PORT = {
>             port <- value
>         }, OUT = {
>             outfile <- value
>         }, TIMEOUT = {
>             timeout <- value
>         }, XDR = {
>             useXDR <- as.logical(value)
>         })
>     }
>     if (is.na(port))
>         stop("PORT must be specified")
>     sinkWorkerOutput(outfile)
>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>     cat(msg)
>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
> }
> <bytecode: 0x4bd4b58>
> <environment: namespace:parallel>
>
> Yet, fix that does *not* seem to change anything.
>
> /Henrik
>
> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
> <[hidden email]> wrote:
>> BACKGROUND:
>>
>> While troubleshooting random, occasionally occurring, errors from
>> parallel::makePSOCKcluster("localhost", port = 11000);
>>
>> Error in socketConnection("localhost", port = port, server = TRUE,
>> blocking = TRUE, :
>>     cannot open the connection
>>
>> I had another look at parallel:::newPSOCKnode(), which is used
>> internally to set up each background worker.  It is designed to, first
>> launch the background worker as:
>>
>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>> wait = FALSE)
>>
>> which immediately tries to connect to a socket on localhost:11000 with
>> timeout.  Immediately after the master launched the above (without
>> waiting), it will set up the connection waiting for the connect from
>> the background worker:
>>
>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>
>>
>> ISSUE:
>>
>> If we emulate the above process, and remove the OUT=/dev/null such
>> that we can see the output produces by the worker, as:
>>
>> setup <- function(delay = 0) {
>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>   Sys.sleep(delay)
>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>> = TRUE, open = "a+b", timeout = 20)
>> }
>>
>> doing:
>>
>>> con <- setup(0)
>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>
>> will most likely work, but adding a delay:
>>
>>> con <- setup(5)
>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>> = "a+b",  :
>>   localhost:11000 cannot be opened
>> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
>>   cannot open the connection
>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>> socketConnection
>>
>> will produce an *instant* error on the worker, and before master opens
>> the server socket.  Eventually, master will produce the originally
>> observed error:
>>
>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>> blocking = TRUE,  :
>>   cannot open the connection
>>
>> In other words, if the master fails to setup socketConnection()
>> *before* the background workers attempts to connect, it all fails.
>> Such a delay may happen for instance when there is a large CPU load on
>> the test machine.
>>
>> Is this the above bug?
>>
>> /Henrik
>>
>> PS. The background is that I, very occasionally, observe R CMD check
>> error on the above (on CRAN and elsewhere) when testing my future
>> package. The error always go away when retested. This far I've though
>> this is due to port clashes (since the port is random selected in
>> [11000:11999]) and accepted that it happens.  However, after
>> discovering the above, it could be due to the worker launching "too
>> soon".
>>
>>> sessionInfo()
>> R version 3.4.3 (2017-11-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 16.04.4 LTS
>>
>> Matrix products: default
>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.3

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

luke-tierney
I'm happy to look at a patch that does this.  I'd start with a small
interval and increase it by 50%, say, on each try wit a max retry time
limit. This isn't eliminating the problem,only reducing the
probability, but still worth it. I had considered doing something like
this but it didn't seem necessary at the time. You don't want to retry
indefinitely since the connection could be failing because the master
died, and then you want the workers to die as well.

Best,

luke

On Fri, 9 Mar 2018, Henrik Bengtsson wrote:

> A solution is to have parallel:::.slaveRSOCK() attempt to connect
> multiple times before failing, e.g.
>
>    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
> = 10L, interval = 1.0) {
>        port <- as.integer(port)
>        for (i in seq_len(maxTries)) {
>          con <- tryCatch({
>            socketConnection(master, port = port, blocking = TRUE,
>                                    open = "a+b", timeout = timeout)
>          }, error = identity)
>          if (inherits(con, "connection")) break
>          Sys.sleep(interval)
>        }
>        if (inherits(con, "error")) stop(con)
>        structure(list(con = con), class = if (useXDR)
>            "SOCKnode"
>        else "SOCK0node")
>    }
>
> One could set 'maxTries' and 'interval' via commandArgs() like what is
> done for the other arguments.
>
> I'm happy to submit an SVN patch if R core thinks this is an
> acceptable solution.
>
> /Henrik
>
> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
> <[hidden email]> wrote:
>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>> socketConnection() as a character, i.e. there's a missing timeout <-
>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>> as.logical(value):
>>
>>> parallel:::.slaveRSOCK
>> function ()
>> {
>>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>         port <- as.integer(port)
>>         con <- socketConnection(master, port = port, blocking = TRUE,
>>             open = "a+b", timeout = timeout)
>>         structure(list(con = con), class = if (useXDR)
>>             "SOCKnode"
>>         else "SOCK0node")
>>     }
>>     master <- "localhost"
>>     port <- NA_integer_
>>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>     methods <- TRUE
>>     useXDR <- TRUE
>>     for (a in commandArgs(TRUE)) {
>>         pos <- regexpr("=", a)
>>         name <- substr(a, 1L, pos - 1L)
>>         value <- substr(a, pos + 1L, nchar(a))
>>         switch(name, MASTER = {
>>             master <- value
>>         }, PORT = {
>>             port <- value
>>         }, OUT = {
>>             outfile <- value
>>         }, TIMEOUT = {
>>             timeout <- value
>>         }, XDR = {
>>             useXDR <- as.logical(value)
>>         })
>>     }
>>     if (is.na(port))
>>         stop("PORT must be specified")
>>     sinkWorkerOutput(outfile)
>>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>     cat(msg)
>>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>> }
>> <bytecode: 0x4bd4b58>
>> <environment: namespace:parallel>
>>
>> Yet, fix that does *not* seem to change anything.
>>
>> /Henrik
>>
>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>> <[hidden email]> wrote:
>>> BACKGROUND:
>>>
>>> While troubleshooting random, occasionally occurring, errors from
>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>
>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>> blocking = TRUE, :
>>>     cannot open the connection
>>>
>>> I had another look at parallel:::newPSOCKnode(), which is used
>>> internally to set up each background worker.  It is designed to, first
>>> launch the background worker as:
>>>
>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>> wait = FALSE)
>>>
>>> which immediately tries to connect to a socket on localhost:11000 with
>>> timeout.  Immediately after the master launched the above (without
>>> waiting), it will set up the connection waiting for the connect from
>>> the background worker:
>>>
>>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>>
>>>
>>> ISSUE:
>>>
>>> If we emulate the above process, and remove the OUT=/dev/null such
>>> that we can see the output produces by the worker, as:
>>>
>>> setup <- function(delay = 0) {
>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>   Sys.sleep(delay)
>>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>>> = TRUE, open = "a+b", timeout = 20)
>>> }
>>>
>>> doing:
>>>
>>>> con <- setup(0)
>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>
>>> will most likely work, but adding a delay:
>>>
>>>> con <- setup(5)
>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>> = "a+b",  :
>>>   localhost:11000 cannot be opened
>>> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b",  :
>>>   cannot open the connection
>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>> socketConnection
>>>
>>> will produce an *instant* error on the worker, and before master opens
>>> the server socket.  Eventually, master will produce the originally
>>> observed error:
>>>
>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>> blocking = TRUE,  :
>>>   cannot open the connection
>>>
>>> In other words, if the master fails to setup socketConnection()
>>> *before* the background workers attempts to connect, it all fails.
>>> Such a delay may happen for instance when there is a large CPU load on
>>> the test machine.
>>>
>>> Is this the above bug?
>>>
>>> /Henrik
>>>
>>> PS. The background is that I, very occasionally, observe R CMD check
>>> error on the above (on CRAN and elsewhere) when testing my future
>>> package. The error always go away when retested. This far I've though
>>> this is due to port clashes (since the port is random selected in
>>> [11000:11999]) and accepted that it happens.  However, after
>>> discovering the above, it could be due to the worker launching "too
>>> soon".
>>>
>>>> sessionInfo()
>>> R version 3.4.3 (2017-11-30)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 16.04.4 LTS
>>>
>>> Matrix products: default
>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_3.4.3
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

Henrik Bengtsson-5
Great.

For the record of this thread, I've submitted patch PR17391
(https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391).  I've
patched it against the latest R-devel on the SVN, passes 'make
check-all', and I've verified it works with the above tests.

/Henrik

On Fri, Mar 9, 2018 at 4:37 AM,  <[hidden email]> wrote:

> I'm happy to look at a patch that does this.  I'd start with a small
> interval and increase it by 50%, say, on each try wit a max retry time
> limit. This isn't eliminating the problem,only reducing the
> probability, but still worth it. I had considered doing something like
> this but it didn't seem necessary at the time. You don't want to retry
> indefinitely since the connection could be failing because the master
> died, and then you want the workers to die as well.
>
> Best,
>
> luke
>
>
> On Fri, 9 Mar 2018, Henrik Bengtsson wrote:
>
>> A solution is to have parallel:::.slaveRSOCK() attempt to connect
>> multiple times before failing, e.g.
>>
>>    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
>> = 10L, interval = 1.0) {
>>        port <- as.integer(port)
>>        for (i in seq_len(maxTries)) {
>>          con <- tryCatch({
>>            socketConnection(master, port = port, blocking = TRUE,
>>                                    open = "a+b", timeout = timeout)
>>          }, error = identity)
>>          if (inherits(con, "connection")) break
>>          Sys.sleep(interval)
>>        }
>>        if (inherits(con, "error")) stop(con)
>>        structure(list(con = con), class = if (useXDR)
>>            "SOCKnode"
>>        else "SOCK0node")
>>    }
>>
>> One could set 'maxTries' and 'interval' via commandArgs() like what is
>> done for the other arguments.
>>
>> I'm happy to submit an SVN patch if R core thinks this is an
>> acceptable solution.
>>
>> /Henrik
>>
>> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
>> <[hidden email]> wrote:
>>>
>>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>>> socketConnection() as a character, i.e. there's a missing timeout <-
>>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>>> as.logical(value):
>>>
>>>> parallel:::.slaveRSOCK
>>>
>>> function ()
>>> {
>>>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>>         port <- as.integer(port)
>>>         con <- socketConnection(master, port = port, blocking = TRUE,
>>>             open = "a+b", timeout = timeout)
>>>         structure(list(con = con), class = if (useXDR)
>>>             "SOCKnode"
>>>         else "SOCK0node")
>>>     }
>>>     master <- "localhost"
>>>     port <- NA_integer_
>>>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>>     methods <- TRUE
>>>     useXDR <- TRUE
>>>     for (a in commandArgs(TRUE)) {
>>>         pos <- regexpr("=", a)
>>>         name <- substr(a, 1L, pos - 1L)
>>>         value <- substr(a, pos + 1L, nchar(a))
>>>         switch(name, MASTER = {
>>>             master <- value
>>>         }, PORT = {
>>>             port <- value
>>>         }, OUT = {
>>>             outfile <- value
>>>         }, TIMEOUT = {
>>>             timeout <- value
>>>         }, XDR = {
>>>             useXDR <- as.logical(value)
>>>         })
>>>     }
>>>     if (is.na(port))
>>>         stop("PORT must be specified")
>>>     sinkWorkerOutput(outfile)
>>>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>>     cat(msg)
>>>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>>> }
>>> <bytecode: 0x4bd4b58>
>>> <environment: namespace:parallel>
>>>
>>> Yet, fix that does *not* seem to change anything.
>>>
>>> /Henrik
>>>
>>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>>> <[hidden email]> wrote:
>>>>
>>>> BACKGROUND:
>>>>
>>>> While troubleshooting random, occasionally occurring, errors from
>>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>>
>>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>>> blocking = TRUE, :
>>>>     cannot open the connection
>>>>
>>>> I had another look at parallel:::newPSOCKnode(), which is used
>>>> internally to set up each background worker.  It is designed to, first
>>>> launch the background worker as:
>>>>
>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>>> wait = FALSE)
>>>>
>>>> which immediately tries to connect to a socket on localhost:11000 with
>>>> timeout.  Immediately after the master launched the above (without
>>>> waiting), it will set up the connection waiting for the connect from
>>>> the background worker:
>>>>
>>>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>>>
>>>>
>>>> ISSUE:
>>>>
>>>> If we emulate the above process, and remove the OUT=/dev/null such
>>>> that we can see the output produces by the worker, as:
>>>>
>>>> setup <- function(delay = 0) {
>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>>   Sys.sleep(delay)
>>>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>>>> = TRUE, open = "a+b", timeout = 20)
>>>> }
>>>>
>>>> doing:
>>>>
>>>>> con <- setup(0)
>>>>
>>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>>
>>>> will most likely work, but adding a delay:
>>>>
>>>>> con <- setup(5)
>>>>
>>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>>> = "a+b",  :
>>>>   localhost:11000 cannot be opened
>>>> Error in socketConnection(master, port = port, blocking = TRUE, open =
>>>> "a+b",  :
>>>>   cannot open the connection
>>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>>> socketConnection
>>>>
>>>> will produce an *instant* error on the worker, and before master opens
>>>> the server socket.  Eventually, master will produce the originally
>>>> observed error:
>>>>
>>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>>> blocking = TRUE,  :
>>>>   cannot open the connection
>>>>
>>>> In other words, if the master fails to setup socketConnection()
>>>> *before* the background workers attempts to connect, it all fails.
>>>> Such a delay may happen for instance when there is a large CPU load on
>>>> the test machine.
>>>>
>>>> Is this the above bug?
>>>>
>>>> /Henrik
>>>>
>>>> PS. The background is that I, very occasionally, observe R CMD check
>>>> error on the above (on CRAN and elsewhere) when testing my future
>>>> package. The error always go away when retested. This far I've though
>>>> this is due to port clashes (since the port is random selected in
>>>> [11000:11999]) and accepted that it happens.  However, after
>>>> discovering the above, it could be due to the worker launching "too
>>>> soon".
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 3.4.3 (2017-11-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: Ubuntu 16.04.4 LTS
>>>>
>>>> Matrix products: default
>>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_3.4.3
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)

luke-tierney
Thanks. Fix committed to R-devel in r74417.

Best,

luke

On Sat, 10 Mar 2018, Henrik Bengtsson wrote:

> Great.
>
> For the record of this thread, I've submitted patch PR17391
> (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391).  I've
> patched it against the latest R-devel on the SVN, passes 'make
> check-all', and I've verified it works with the above tests.
>
> /Henrik
>
> On Fri, Mar 9, 2018 at 4:37 AM,  <[hidden email]> wrote:
>> I'm happy to look at a patch that does this.  I'd start with a small
>> interval and increase it by 50%, say, on each try wit a max retry time
>> limit. This isn't eliminating the problem,only reducing the
>> probability, but still worth it. I had considered doing something like
>> this but it didn't seem necessary at the time. You don't want to retry
>> indefinitely since the connection could be failing because the master
>> died, and then you want the workers to die as well.
>>
>> Best,
>>
>> luke
>>
>>
>> On Fri, 9 Mar 2018, Henrik Bengtsson wrote:
>>
>>> A solution is to have parallel:::.slaveRSOCK() attempt to connect
>>> multiple times before failing, e.g.
>>>
>>>    makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries
>>> = 10L, interval = 1.0) {
>>>        port <- as.integer(port)
>>>        for (i in seq_len(maxTries)) {
>>>          con <- tryCatch({
>>>            socketConnection(master, port = port, blocking = TRUE,
>>>                                    open = "a+b", timeout = timeout)
>>>          }, error = identity)
>>>          if (inherits(con, "connection")) break
>>>          Sys.sleep(interval)
>>>        }
>>>        if (inherits(con, "error")) stop(con)
>>>        structure(list(con = con), class = if (useXDR)
>>>            "SOCKnode"
>>>        else "SOCK0node")
>>>    }
>>>
>>> One could set 'maxTries' and 'interval' via commandArgs() like what is
>>> done for the other arguments.
>>>
>>> I'm happy to submit an SVN patch if R core thinks this is an
>>> acceptable solution.
>>>
>>> /Henrik
>>>
>>> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson
>>> <[hidden email]> wrote:
>>>>
>>>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to
>>>> socketConnection() as a character, i.e. there's a missing timeout <-
>>>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <-
>>>> as.logical(value):
>>>>
>>>>> parallel:::.slaveRSOCK
>>>>
>>>> function ()
>>>> {
>>>>     makeSOCKmaster <- function(master, port, timeout, useXDR) {
>>>>         port <- as.integer(port)
>>>>         con <- socketConnection(master, port = port, blocking = TRUE,
>>>>             open = "a+b", timeout = timeout)
>>>>         structure(list(con = con), class = if (useXDR)
>>>>             "SOCKnode"
>>>>         else "SOCK0node")
>>>>     }
>>>>     master <- "localhost"
>>>>     port <- NA_integer_
>>>>     outfile <- Sys.getenv("R_SNOW_OUTFILE")
>>>>     methods <- TRUE
>>>>     useXDR <- TRUE
>>>>     for (a in commandArgs(TRUE)) {
>>>>         pos <- regexpr("=", a)
>>>>         name <- substr(a, 1L, pos - 1L)
>>>>         value <- substr(a, pos + 1L, nchar(a))
>>>>         switch(name, MASTER = {
>>>>             master <- value
>>>>         }, PORT = {
>>>>             port <- value
>>>>         }, OUT = {
>>>>             outfile <- value
>>>>         }, TIMEOUT = {
>>>>             timeout <- value
>>>>         }, XDR = {
>>>>             useXDR <- as.logical(value)
>>>>         })
>>>>     }
>>>>     if (is.na(port))
>>>>         stop("PORT must be specified")
>>>>     sinkWorkerOutput(outfile)
>>>>     msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(),
>>>>         paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3"))
>>>>     cat(msg)
>>>>     slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
>>>> }
>>>> <bytecode: 0x4bd4b58>
>>>> <environment: namespace:parallel>
>>>>
>>>> Yet, fix that does *not* seem to change anything.
>>>>
>>>> /Henrik
>>>>
>>>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson
>>>> <[hidden email]> wrote:
>>>>>
>>>>> BACKGROUND:
>>>>>
>>>>> While troubleshooting random, occasionally occurring, errors from
>>>>> parallel::makePSOCKcluster("localhost", port = 11000);
>>>>>
>>>>> Error in socketConnection("localhost", port = port, server = TRUE,
>>>>> blocking = TRUE, :
>>>>>     cannot open the connection
>>>>>
>>>>> I had another look at parallel:::newPSOCKnode(), which is used
>>>>> internally to set up each background worker.  It is designed to, first
>>>>> launch the background worker as:
>>>>>
>>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE',
>>>>> wait = FALSE)
>>>>>
>>>>> which immediately tries to connect to a socket on localhost:11000 with
>>>>> timeout.  Immediately after the master launched the above (without
>>>>> waiting), it will set up the connection waiting for the connect from
>>>>> the background worker:
>>>>>
>>>>>     con <- socketConnection("localhost", port = 11000, server = TRUE,
>>>>>         blocking = TRUE, open = "a+b", timeout = timeout)
>>>>>
>>>>>
>>>>> ISSUE:
>>>>>
>>>>> If we emulate the above process, and remove the OUT=/dev/null such
>>>>> that we can see the output produces by the worker, as:
>>>>>
>>>>> setup <- function(delay = 0) {
>>>>>   system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args
>>>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE)
>>>>>   Sys.sleep(delay)
>>>>>   socketConnection("localhost", port = 11000, server = TRUE, blocking
>>>>> = TRUE, open = "a+b", timeout = 20)
>>>>> }
>>>>>
>>>>> doing:
>>>>>
>>>>>> con <- setup(0)
>>>>>
>>>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087
>>>>>
>>>>> will most likely work, but adding a delay:
>>>>>
>>>>>> con <- setup(5)
>>>>>
>>>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617
>>>>> Warning in socketConnection(master, port = port, blocking = TRUE, open
>>>>> = "a+b",  :
>>>>>   localhost:11000 cannot be opened
>>>>> Error in socketConnection(master, port = port, blocking = TRUE, open =
>>>>> "a+b",  :
>>>>>   cannot open the connection
>>>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster ->
>>>>> socketConnection
>>>>>
>>>>> will produce an *instant* error on the worker, and before master opens
>>>>> the server socket.  Eventually, master will produce the originally
>>>>> observed error:
>>>>>
>>>>> Error in socketConnection("localhost", port = 11000, server = TRUE,
>>>>> blocking = TRUE,  :
>>>>>   cannot open the connection
>>>>>
>>>>> In other words, if the master fails to setup socketConnection()
>>>>> *before* the background workers attempts to connect, it all fails.
>>>>> Such a delay may happen for instance when there is a large CPU load on
>>>>> the test machine.
>>>>>
>>>>> Is this the above bug?
>>>>>
>>>>> /Henrik
>>>>>
>>>>> PS. The background is that I, very occasionally, observe R CMD check
>>>>> error on the above (on CRAN and elsewhere) when testing my future
>>>>> package. The error always go away when retested. This far I've though
>>>>> this is due to port clashes (since the port is random selected in
>>>>> [11000:11999]) and accepted that it happens.  However, after
>>>>> discovering the above, it could be due to the worker launching "too
>>>>> soon".
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 3.4.3 (2017-11-30)
>>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>> Running under: Ubuntu 16.04.4 LTS
>>>>>
>>>>> Matrix products: default
>>>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
>>>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
>>>>>
>>>>> locale:
>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] compiler_3.4.3
>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> --
>> Luke Tierney
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>    Actuarial Science
>> 241 Schaeffer Hall                  email:   [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel