Deep Replicable Bug With AMD Threadripper MultiCore

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Deep Replicable Bug With AMD Threadripper MultiCore

ivo welch-4
The following program is whittled down from a much larger program that
always works on Intel, and always works on AMD's threadripper with
lapply but not mclappy.  With mclapply on AMD, all processes go into
"suspend" mode and the program then hangs.  This bug is replicable on an
AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running
latest ubuntu 18.04.  The R version 3.5.3 (2019-03-11) -- "Great Truth" ,
invoked with --vanilla.  I hope this helps...it took quite a while to get
it to this stage.  I sure hope that I am not reporting an old bug...

options("mc.cores"=4)
library(data.table)
library(parallel)

if (!file.exists("bugsample.csv")) {
    NR <- 64833330
    notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR),
x2=log(1:NR))
    fwrite(notused, file="bugsample.csv")
    stop("you can quit now and restart the program")
}

if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf)  ##
needed!  Inf cannot be replaced by actual NR


sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 )
) )
sample$yvar <- sin(1:nrow(sample))
sample$xvar <- 1:nrow(sample)


testfun <- function(dl) {
    with(dl, message("Working: ", first(groupidentifier), " with ",
nrow(dl)))

    lapply( 1:nrow(dl), FUN=function(onedayindex) {
        if ((onedayindex %% 500) != 0) return(NULL)
        with(dl[1:onedayindex,],
             c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,]
))[2], error = function(e) NA ) ) )
    })
}


message("starting --- replicable hang with mclapply, but not lapply")

o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ),
              FUN=function(.index) testfun( sample[.index, , drop=FALSE] ))

message("never gets here with mclapply")

print( do.call("c", o[[1]]) )
print( do.call("c", o[[2]]) )



--
Ivo Welch ([hidden email])

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deep Replicable Bug With AMD Threadripper MultiCore

Dirk Eddelbuettel

On 4 April 2019 at 17:28, ivo welch wrote:
| The following program is whittled down from a much larger program that
| always works on Intel, and always works on AMD's threadripper with
| lapply but not mclappy.  With mclapply on AMD, all processes go into
| "suspend" mode and the program then hangs.  This bug is replicable on an
| AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running
| latest ubuntu 18.04.  The R version 3.5.3 (2019-03-11) -- "Great Truth" ,
| invoked with --vanilla.  I hope this helps...it took quite a while to get
| it to this stage.  I sure hope that I am not reporting an old bug...
|
| options("mc.cores"=4)
| library(data.table)
| library(parallel)

Just how you set mc.cores to 4 for parallel::mclapply I would try throttling
data.table which in its current version goes for all cores. So do, say,

  setDTthreads(4)

and see if that helps. Try lower and lower values to see if you get by.
While there may well be a different race condition in mclapply, it may help
to not overschedule.

(FWIW, the next version of data.table, in queue at CRAN, is less aggressive
and has additional options for fine tuning.)

Dirk

| if (!file.exists("bugsample.csv")) {
|     NR <- 64833330
|     notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR),
| x2=log(1:NR))
|     fwrite(notused, file="bugsample.csv")
|     stop("you can quit now and restart the program")
| }
|
| if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf)  ##
| needed!  Inf cannot be replaced by actual NR
|
|
| sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 )
| ) )
| sample$yvar <- sin(1:nrow(sample))
| sample$xvar <- 1:nrow(sample)
|
|
| testfun <- function(dl) {
|     with(dl, message("Working: ", first(groupidentifier), " with ",
| nrow(dl)))
|
|     lapply( 1:nrow(dl), FUN=function(onedayindex) {
|         if ((onedayindex %% 500) != 0) return(NULL)
|         with(dl[1:onedayindex,],
|              c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,]
| ))[2], error = function(e) NA ) ) )
|     })
| }
|
|
| message("starting --- replicable hang with mclapply, but not lapply")
|
| o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ),
|               FUN=function(.index) testfun( sample[.index, , drop=FALSE] ))
|
| message("never gets here with mclapply")
|
| print( do.call("c", o[[1]]) )
| print( do.call("c", o[[2]]) )
|
|
|
| --
| Ivo Welch ([hidden email])
|
| [[alternative HTML version deleted]]
|
| ______________________________________________
| [hidden email] mailing list
| https://stat.ethz.ch/mailman/listinfo/r-devel

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deep Replicable Bug With AMD Threadripper MultiCore

Tomas Kalibera

In addition you can also try to use a PSOCK cluster (see makeCluster,
parLapply) to avoid the problem - it should help if the problem is
somehow related to forking in mclapply().

The problem you are seeing may be in base R, in data.table, or in
interaction between the two (mclapply() from base R uses forking
directly, data.table uses OpenMP). If you think the bug is in base R, it
would be much better if you could find a reproducible example that would
only use packages shipped directly with R, otherwise it might be best to
contact the maintainer of data.table.

Please also make sure to use the latest version of R 3.5 (or R-devel).
The implementation of forking in parallel packages, and hence also in
mclapply, has been rewritten since R 3.4.

Best
Tomas

On 4/5/19 1:28 PM, Dirk Eddelbuettel wrote:

> On 4 April 2019 at 17:28, ivo welch wrote:
> | The following program is whittled down from a much larger program that
> | always works on Intel, and always works on AMD's threadripper with
> | lapply but not mclappy.  With mclapply on AMD, all processes go into
> | "suspend" mode and the program then hangs.  This bug is replicable on an
> | AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running
> | latest ubuntu 18.04.  The R version 3.5.3 (2019-03-11) -- "Great Truth" ,
> | invoked with --vanilla.  I hope this helps...it took quite a while to get
> | it to this stage.  I sure hope that I am not reporting an old bug...
> |
> | options("mc.cores"=4)
> | library(data.table)
> | library(parallel)
>
> Just how you set mc.cores to 4 for parallel::mclapply I would try throttling
> data.table which in its current version goes for all cores. So do, say,
>
>    setDTthreads(4)
>
> and see if that helps. Try lower and lower values to see if you get by.
> While there may well be a different race condition in mclapply, it may help
> to not overschedule.
>
> (FWIW, the next version of data.table, in queue at CRAN, is less aggressive
> and has additional options for fine tuning.)
>
> Dirk
>
> | if (!file.exists("bugsample.csv")) {
> |     NR <- 64833330
> |     notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR),
> | x2=log(1:NR))
> |     fwrite(notused, file="bugsample.csv")
> |     stop("you can quit now and restart the program")
> | }
> |
> | if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf)  ##
> | needed!  Inf cannot be replaced by actual NR
> |
> |
> | sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 )
> | ) )
> | sample$yvar <- sin(1:nrow(sample))
> | sample$xvar <- 1:nrow(sample)
> |
> |
> | testfun <- function(dl) {
> |     with(dl, message("Working: ", first(groupidentifier), " with ",
> | nrow(dl)))
> |
> |     lapply( 1:nrow(dl), FUN=function(onedayindex) {
> |         if ((onedayindex %% 500) != 0) return(NULL)
> |         with(dl[1:onedayindex,],
> |              c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,]
> | ))[2], error = function(e) NA ) ) )
> |     })
> | }
> |
> |
> | message("starting --- replicable hang with mclapply, but not lapply")
> |
> | o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ),
> |               FUN=function(.index) testfun( sample[.index, , drop=FALSE] ))
> |
> | message("never gets here with mclapply")
> |
> | print( do.call("c", o[[1]]) )
> | print( do.call("c", o[[2]]) )
> |
> |
> |
> | --
> | Ivo Welch ([hidden email])
> |
> | [[alternative HTML version deleted]]
> |
> | ______________________________________________
> | [hidden email] mailing list
> | https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel