# Basic join question

6 messages
Open this post in threaded view
|
Report Content as Inappropriate

## Basic join question

 How do I do the equivalent to the following? with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),   above as (select a, b from dt where b > .5),   below as (select a, b from dt where b < .5) select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;  a | count ---+-------  3 |     0  2 |     1 (2 rows) How do I accomplish the same thing with data.tables?  This is what I have so far: DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) above = DT[DT\$b > .5] below = DT[DT\$b < .5, list(a=a)] key(below) = 'a' below[above, list(count=length(a)), by=a] but this gives me:       a count [1,]  2 1 [2,] NA 1 Thanks in advance for any tips. -- Yang Zhang http://yz.mit.edu/_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Basic join question

 Yang, Since you also asked on SO, suggest we answer there (after your edit please) : http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-rMatthew "Yang Zhang" <[hidden email]> wrote in message news:[hidden email]... > How do I do the equivalent to the following? > > with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 > union select 2, 1 union select 3, 1 union select 3, 1), >  above as (select a, b from dt where b > .5), >  below as (select a, b from dt where b < .5) > select above.a, count(below.a) from above left outer join below on > (above.a = below.a) group by above.a; > a | count > ---+------- > 3 |     0 > 2 |     1 > (2 rows) > > How do I accomplish the same thing with data.tables?  This is what I > have so far: > > DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) > above = DT[DT\$b > .5] > below = DT[DT\$b < .5, list(a=a)] > key(below) = 'a' > below[above, list(count=length(a)), by=a] > > but this gives me: > >      a count > [1,]  2 1 > [2,] NA 1 > > Thanks in advance for any tips. > > -- > Yang Zhang > http://yz.mit.edu/  _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Basic join question

 Thanks, edited the question. On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: > Yang, > Since you also asked on SO, suggest we answer there (after your edit please) > : > http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r> Matthew > > > "Yang Zhang" <[hidden email]> wrote in message > news:[hidden email]... >> How do I do the equivalent to the following? >> >> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >> union select 2, 1 union select 3, 1 union select 3, 1), >>  above as (select a, b from dt where b > .5), >>  below as (select a, b from dt where b < .5) >> select above.a, count(below.a) from above left outer join below on >> (above.a = below.a) group by above.a; >> a | count >> ---+------- >> 3 |     0 >> 2 |     1 >> (2 rows) >> >> How do I accomplish the same thing with data.tables?  This is what I >> have so far: >> >> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >> above = DT[DT\$b > .5] >> below = DT[DT\$b < .5, list(a=a)] >> key(below) = 'a' >> below[above, list(count=length(a)), by=a] >> >> but this gives me: >> >>      a count >> [1,]  2 1 >> [2,] NA 1 >> >> Thanks in advance for any tips. >> >> -- >> Yang Zhang >> http://yz.mit.edu/> > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help> -- Yang Zhang http://yz.mit.edu/_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Basic join question

 I'm going to continue here since the question is a bit more complicated and SO isn't the best forum for back-and-forth. If I'm trying to do a join where I'm trying to aggregate counts (including 0s for nomatches), is there something more concise than the following, which is what I'm currently using since it works? # assume dt is a data.frame(user_id=..., age=...) y = dt[, list(count=length(age)), by=user_id] key(y) = 'user_id' y = y[J(unique(x\$user_id))] y\$count[is.na(y\$count)] = 0 I tried: > key(y) = 'user_id' > y = y[J(unique(x\$user_id)), list(count=length(age))] > summary(y\$count)      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      1.00      1.00      1.00     75.55      5.00 127200.00 > dim(y) [1] 7655    2 which gives me the right number of output rows but none of the lengths are 0, presumably because length(NA) == 1. (There are definitely users in x that are not in y.) But then when I tried (and there are no NAs in y\$age): > count = function(x) if (any(is.na(x))) integer(0) else length(x) > key(y) = 'user_id' > y = y[J(unique(x\$user_id)), list(count=count(age))] > summary(y\$count)     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.      1.0      2.0      6.0    160.4     21.0 127200.0 > dim(y) [1] 3581    2 Rows seem to be disappearing, and still the min is 1. At this point I'm pretty disoriented. Any explanation? Thanks in advance. On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote: > Thanks, edited the question. > > On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: >> Yang, >> Since you also asked on SO, suggest we answer there (after your edit please) >> : >> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r>> Matthew >> >> >> "Yang Zhang" <[hidden email]> wrote in message >> news:[hidden email]... >>> How do I do the equivalent to the following? >>> >>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >>> union select 2, 1 union select 3, 1 union select 3, 1), >>>  above as (select a, b from dt where b > .5), >>>  below as (select a, b from dt where b < .5) >>> select above.a, count(below.a) from above left outer join below on >>> (above.a = below.a) group by above.a; >>> a | count >>> ---+------- >>> 3 |     0 >>> 2 |     1 >>> (2 rows) >>> >>> How do I accomplish the same thing with data.tables?  This is what I >>> have so far: >>> >>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >>> above = DT[DT\$b > .5] >>> below = DT[DT\$b < .5, list(a=a)] >>> key(below) = 'a' >>> below[above, list(count=length(a)), by=a] >>> >>> but this gives me: >>> >>>      a count >>> [1,]  2 1 >>> [2,] NA 1 >>> >>> Thanks in advance for any tips. >>> >>> -- >>> Yang Zhang >>> http://yz.mit.edu/>> >> >> >> _______________________________________________ >> datatable-help mailing list >> [hidden email] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>> > > > > -- > Yang Zhang > http://yz.mit.edu/> -- Yang Zhang http://yz.mit.edu/_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Basic join question

 The docs say: "When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out." Some additional quick questions: 1. *Which* columns of i are used in the join (assuming no keys are set)? Is it just left-to-right? 2. When there are two columns with the same name in x and i (which aren't being used as join keys), is just the one from x kept? On Wed, Aug 17, 2011 at 6:26 PM, Yang Zhang <[hidden email]> wrote: > I'm going to continue here since the question is a bit more > complicated and SO isn't the best forum for back-and-forth. > > If I'm trying to do a join where I'm trying to aggregate counts > (including 0s for nomatches), is there something more concise than the > following, which is what I'm currently using since it works? > > # assume dt is a data.frame(user_id=..., age=...) > y = dt[, list(count=length(age)), by=user_id] > key(y) = 'user_id' > y = y[J(unique(x\$user_id))] > y\$count[is.na(y\$count)] = 0 > > I tried: > >> key(y) = 'user_id' >> y = y[J(unique(x\$user_id)), list(count=length(age))] >> summary(y\$count) >     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. >     1.00      1.00      1.00     75.55      5.00 127200.00 >> dim(y) > [1] 7655    2 > > which gives me the right number of output rows but none of the lengths > are 0, presumably because length(NA) == 1. (There are definitely users > in x that are not in y.) > > But then when I tried (and there are no NAs in y\$age): > >> count = function(x) if (any(is.na(x))) integer(0) else length(x) >> key(y) = 'user_id' >> y = y[J(unique(x\$user_id)), list(count=count(age))] >> summary(y\$count) >    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. >     1.0      2.0      6.0    160.4     21.0 127200.0 >> dim(y) > [1] 3581    2 > > Rows seem to be disappearing, and still the min is 1. > > At this point I'm pretty disoriented. Any explanation? Thanks in advance. > > > On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote: >> Thanks, edited the question. >> >> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: >>> Yang, >>> Since you also asked on SO, suggest we answer there (after your edit please) >>> : >>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r>>> Matthew >>> >>> >>> "Yang Zhang" <[hidden email]> wrote in message >>> news:[hidden email]... >>>> How do I do the equivalent to the following? >>>> >>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >>>> union select 2, 1 union select 3, 1 union select 3, 1), >>>>  above as (select a, b from dt where b > .5), >>>>  below as (select a, b from dt where b < .5) >>>> select above.a, count(below.a) from above left outer join below on >>>> (above.a = below.a) group by above.a; >>>> a | count >>>> ---+------- >>>> 3 |     0 >>>> 2 |     1 >>>> (2 rows) >>>> >>>> How do I accomplish the same thing with data.tables?  This is what I >>>> have so far: >>>> >>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >>>> above = DT[DT\$b > .5] >>>> below = DT[DT\$b < .5, list(a=a)] >>>> key(below) = 'a' >>>> below[above, list(count=length(a)), by=a] >>>> >>>> but this gives me: >>>> >>>>      a count >>>> [1,]  2 1 >>>> [2,] NA 1 >>>> >>>> Thanks in advance for any tips. >>>> >>>> -- >>>> Yang Zhang >>>> http://yz.mit.edu/>>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [hidden email] >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>>> >> >> >> >> -- >> Yang Zhang >> http://yz.mit.edu/>> > > > > -- > Yang Zhang > http://yz.mit.edu/> -- Yang Zhang http://yz.mit.edu/_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help