Quantcast

Parallelize data.table's operations?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Parallelize data.table's operations?

Branson Owen
I just read an article about new plyr package using parallelization to
speed up its performance. Just throw out an idea for data.table to
parallelize some operations and make use of multiple processors
simultaneously. I don't think this is a must-have feature at this
moment though.
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parallelize data.table's operations?

Tom Short-2
Branson,

You can use Simon Urbanek's multicore to parallelize by hand.
From some tests I've done, what you want to do with each grouping
has to be pretty slow to benefit. If you have lots of groupings,
the serial method will generally win because it doesn't have the
overhead. You can use "parallel" and "collect" from multicore
with a data.table with something like:

dt[, parallel(mean(b)), by = "a"]
ans <- collect()

See below for examples and some timings. It doesn't work for
me on windows XP, but it does on Linux.

> library(multicore)
> n <- 1e8
> dt <- data.table(a = sample(1:10, n, replace = TRUE),
+                  b = sample(1:100, n, replace = TRUE),
+                  c = LETTERS[rep(1:500, n/500)], key = "a")
>
> (res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"])
       a   pid
 [1,]  1 19547
 [2,]  2 19548
 [3,]  3 19549
 [4,]  4 19550
 [5,]  5 19551
 [6,]  6 19552
 [7,]  7 19553
 [8,]  8 19554
 [9,]  9 19555
[10,] 10 19556
> (ans <- collect())
$`19556`
[1] 50.50949

$`19555`
[1] 50.49289

$`19554`
[1] 50.48453

$`19553`
[1] 50.48849

$`19552`
[1] 50.51581

$`19551`
[1] 50.49477

$`19550`
[1] 50.50468

$`19549`
[1] 50.50396

$`19548`
[1] 50.495

$`19547`
[1] 50.51994

$`19545`
[1] 50.51657


We're not done yet, because the "ans" is a list, and we
need to merge res and ans to get the results right. I
won't bother with that.

Here are some timings:

> system.time({
+     res <- dt[, list(pid = parallel(mean(b))$pid), by = "a"]
+     ans <- collect()
+ })
   user  system elapsed
  2.880   3.996   5.561
>
> system.time({
+     dt[, mean(b), by = "a"]
+ })
   user  system elapsed
  3.051   2.605   5.660

No gain there, so let's make R work harder on each grouping:

> system.time({
+     res <- dt[, list(pid = parallel(mean(sort(b)))$pid), by = "a"]
+     ans <- collect()
+ })
   user  system elapsed
 17.416   5.138   8.114
>
> system.time({
+     dt[, mean(sort(b)), by = "a"]
+ })
   user  system elapsed
 11.429   2.682  14.120

- Tom

On Mon, Sep 13, 2010 at 10:36 AM, Branson Owen <[hidden email]> wrote:

> I just read an article about new plyr package using parallelization to
> speed up its performance. Just throw out an idea for data.table to
> parallelize some operations and make use of multiple processors
> simultaneously. I don't think this is a must-have feature at this
> moment though.
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parallelize data.table's operations?

Branson Owen
Wow, this is a super clear example and explanation. Highly appreciate,
Tom! This is more than enough key words for someone to search for
their need. I will try your recipe on some of my heavy computation.
Thank you very much once again!
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...