|
I just read an article about new plyr package using parallelization to
speed up its performance. Just throw out an idea for data.table to parallelize some operations and make use of multiple processors simultaneously. I don't think this is a must-have feature at this moment though. _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Branson,
You can use Simon Urbanek's multicore to parallelize by hand. From some tests I've done, what you want to do with each grouping has to be pretty slow to benefit. If you have lots of groupings, the serial method will generally win because it doesn't have the overhead. You can use "parallel" and "collect" from multicore with a data.table with something like: dt[, parallel(mean(b)), by = "a"] ans <- collect() See below for examples and some timings. It doesn't work for me on windows XP, but it does on Linux. > library(multicore) > n <- 1e8 > dt <- data.table(a = sample(1:10, n, replace = TRUE), + b = sample(1:100, n, replace = TRUE), + c = LETTERS[rep(1:500, n/500)], key = "a") > > (res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"]) a pid [1,] 1 19547 [2,] 2 19548 [3,] 3 19549 [4,] 4 19550 [5,] 5 19551 [6,] 6 19552 [7,] 7 19553 [8,] 8 19554 [9,] 9 19555 [10,] 10 19556 > (ans <- collect()) $`19556` [1] 50.50949 $`19555` [1] 50.49289 $`19554` [1] 50.48453 $`19553` [1] 50.48849 $`19552` [1] 50.51581 $`19551` [1] 50.49477 $`19550` [1] 50.50468 $`19549` [1] 50.50396 $`19548` [1] 50.495 $`19547` [1] 50.51994 $`19545` [1] 50.51657 We're not done yet, because the "ans" is a list, and we need to merge res and ans to get the results right. I won't bother with that. Here are some timings: > system.time({ + res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"] + ans <- collect() + }) user system elapsed 2.880 3.996 5.561 > > system.time({ + dt[, mean(b), by = "a"] + }) user system elapsed 3.051 2.605 5.660 No gain there, so let's make R work harder on each grouping: > system.time({ + res <- dt[, list(pid = parallel(mean(sort(b)))$pid), by = "a"] + ans <- collect() + }) user system elapsed 17.416 5.138 8.114 > > system.time({ + dt[, mean(sort(b)), by = "a"] + }) user system elapsed 11.429 2.682 14.120 - Tom On Mon, Sep 13, 2010 at 10:36 AM, Branson Owen <[hidden email]> wrote: > I just read an article about new plyr package using parallelization to > speed up its performance. Just throw out an idea for data.table to > parallelize some operations and make use of multiple processors > simultaneously. I don't think this is a must-have feature at this > moment though. > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Wow, this is a super clear example and explanation. Highly appreciate,
Tom! This is more than enough key words for someone to search for their need. I will try your recipe on some of my heavy computation. Thank you very much once again! _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
| Powered by Nabble | Edit this page |
