|
How do I do the equivalent to the following?
with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1), above as (select a, b from dt where b > .5), below as (select a, b from dt where b < .5) select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a; a | count ---+------- 3 | 0 2 | 1 (2 rows) How do I accomplish the same thing with data.tables? This is what I have so far: DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) above = DT[DT$b > .5] below = DT[DT$b < .5, list(a=a)] key(below) = 'a' below[above, list(count=length(a)), by=a] but this gives me: a count [1,] 2 1 [2,] NA 1 Thanks in advance for any tips. -- Yang Zhang http://yz.mit.edu/ _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Yang,
Since you also asked on SO, suggest we answer there (after your edit please) : http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r Matthew "Yang Zhang" <[hidden email]> wrote in message news:[hidden email]... > How do I do the equivalent to the following? > > with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 > union select 2, 1 union select 3, 1 union select 3, 1), > above as (select a, b from dt where b > .5), > below as (select a, b from dt where b < .5) > select above.a, count(below.a) from above left outer join below on > (above.a = below.a) group by above.a; > a | count > ---+------- > 3 | 0 > 2 | 1 > (2 rows) > > How do I accomplish the same thing with data.tables? This is what I > have so far: > > DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) > above = DT[DT$b > .5] > below = DT[DT$b < .5, list(a=a)] > key(below) = 'a' > below[above, list(count=length(a)), by=a] > > but this gives me: > > a count > [1,] 2 1 > [2,] NA 1 > > Thanks in advance for any tips. > > -- > Yang Zhang > http://yz.mit.edu/ _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Thanks, edited the question.
On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: > Yang, > Since you also asked on SO, suggest we answer there (after your edit please) > : > http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r > Matthew > > > "Yang Zhang" <[hidden email]> wrote in message > news:[hidden email]... >> How do I do the equivalent to the following? >> >> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >> union select 2, 1 union select 3, 1 union select 3, 1), >> above as (select a, b from dt where b > .5), >> below as (select a, b from dt where b < .5) >> select above.a, count(below.a) from above left outer join below on >> (above.a = below.a) group by above.a; >> a | count >> ---+------- >> 3 | 0 >> 2 | 1 >> (2 rows) >> >> How do I accomplish the same thing with data.tables? This is what I >> have so far: >> >> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >> above = DT[DT$b > .5] >> below = DT[DT$b < .5, list(a=a)] >> key(below) = 'a' >> below[above, list(count=length(a)), by=a] >> >> but this gives me: >> >> a count >> [1,] 2 1 >> [2,] NA 1 >> >> Thanks in advance for any tips. >> >> -- >> Yang Zhang >> http://yz.mit.edu/ > > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Yang Zhang http://yz.mit.edu/ _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
I'm going to continue here since the question is a bit more
complicated and SO isn't the best forum for back-and-forth. If I'm trying to do a join where I'm trying to aggregate counts (including 0s for nomatches), is there something more concise than the following, which is what I'm currently using since it works? # assume dt is a data.frame(user_id=..., age=...) y = dt[, list(count=length(age)), by=user_id] key(y) = 'user_id' y = y[J(unique(x$user_id))] y$count[is.na(y$count)] = 0 I tried: > key(y) = 'user_id' > y = y[J(unique(x$user_id)), list(count=length(age))] > summary(y$count) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 1.00 1.00 75.55 5.00 127200.00 > dim(y) [1] 7655 2 which gives me the right number of output rows but none of the lengths are 0, presumably because length(NA) == 1. (There are definitely users in x that are not in y.) But then when I tried (and there are no NAs in y$age): > count = function(x) if (any(is.na(x))) integer(0) else length(x) > key(y) = 'user_id' > y = y[J(unique(x$user_id)), list(count=count(age))] > summary(y$count) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 2.0 6.0 160.4 21.0 127200.0 > dim(y) [1] 3581 2 Rows seem to be disappearing, and still the min is 1. At this point I'm pretty disoriented. Any explanation? Thanks in advance. On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote: > Thanks, edited the question. > > On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: >> Yang, >> Since you also asked on SO, suggest we answer there (after your edit please) >> : >> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r >> Matthew >> >> >> "Yang Zhang" <[hidden email]> wrote in message >> news:[hidden email]... >>> How do I do the equivalent to the following? >>> >>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >>> union select 2, 1 union select 3, 1 union select 3, 1), >>> above as (select a, b from dt where b > .5), >>> below as (select a, b from dt where b < .5) >>> select above.a, count(below.a) from above left outer join below on >>> (above.a = below.a) group by above.a; >>> a | count >>> ---+------- >>> 3 | 0 >>> 2 | 1 >>> (2 rows) >>> >>> How do I accomplish the same thing with data.tables? This is what I >>> have so far: >>> >>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >>> above = DT[DT$b > .5] >>> below = DT[DT$b < .5, list(a=a)] >>> key(below) = 'a' >>> below[above, list(count=length(a)), by=a] >>> >>> but this gives me: >>> >>> a count >>> [1,] 2 1 >>> [2,] NA 1 >>> >>> Thanks in advance for any tips. >>> >>> -- >>> Yang Zhang >>> http://yz.mit.edu/ >> >> >> >> _______________________________________________ >> datatable-help mailing list >> [hidden email] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > > -- > Yang Zhang > http://yz.mit.edu/ > -- Yang Zhang http://yz.mit.edu/ _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
The docs say:
"When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out." Some additional quick questions: 1. *Which* columns of i are used in the join (assuming no keys are set)? Is it just left-to-right? 2. When there are two columns with the same name in x and i (which aren't being used as join keys), is just the one from x kept? On Wed, Aug 17, 2011 at 6:26 PM, Yang Zhang <[hidden email]> wrote: > I'm going to continue here since the question is a bit more > complicated and SO isn't the best forum for back-and-forth. > > If I'm trying to do a join where I'm trying to aggregate counts > (including 0s for nomatches), is there something more concise than the > following, which is what I'm currently using since it works? > > # assume dt is a data.frame(user_id=..., age=...) > y = dt[, list(count=length(age)), by=user_id] > key(y) = 'user_id' > y = y[J(unique(x$user_id))] > y$count[is.na(y$count)] = 0 > > I tried: > >> key(y) = 'user_id' >> y = y[J(unique(x$user_id)), list(count=length(age))] >> summary(y$count) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 1.00 1.00 1.00 75.55 5.00 127200.00 >> dim(y) > [1] 7655 2 > > which gives me the right number of output rows but none of the lengths > are 0, presumably because length(NA) == 1. (There are definitely users > in x that are not in y.) > > But then when I tried (and there are no NAs in y$age): > >> count = function(x) if (any(is.na(x))) integer(0) else length(x) >> key(y) = 'user_id' >> y = y[J(unique(x$user_id)), list(count=count(age))] >> summary(y$count) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 1.0 2.0 6.0 160.4 21.0 127200.0 >> dim(y) > [1] 3581 2 > > Rows seem to be disappearing, and still the min is 1. > > At this point I'm pretty disoriented. Any explanation? Thanks in advance. > > > On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote: >> Thanks, edited the question. >> >> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: >>> Yang, >>> Since you also asked on SO, suggest we answer there (after your edit please) >>> : >>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r >>> Matthew >>> >>> >>> "Yang Zhang" <[hidden email]> wrote in message >>> news:[hidden email]... >>>> How do I do the equivalent to the following? >>>> >>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 >>>> union select 2, 1 union select 3, 1 union select 3, 1), >>>> above as (select a, b from dt where b > .5), >>>> below as (select a, b from dt where b < .5) >>>> select above.a, count(below.a) from above left outer join below on >>>> (above.a = below.a) group by above.a; >>>> a | count >>>> ---+------- >>>> 3 | 0 >>>> 2 | 1 >>>> (2 rows) >>>> >>>> How do I accomplish the same thing with data.tables? This is what I >>>> have so far: >>>> >>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) >>>> above = DT[DT$b > .5] >>>> below = DT[DT$b < .5, list(a=a)] >>>> key(below) = 'a' >>>> below[above, list(count=length(a)), by=a] >>>> >>>> but this gives me: >>>> >>>> a count >>>> [1,] 2 1 >>>> [2,] NA 1 >>>> >>>> Thanks in advance for any tips. >>>> >>>> -- >>>> Yang Zhang >>>> http://yz.mit.edu/ >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [hidden email] >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> >> >> -- >> Yang Zhang >> http://yz.mit.edu/ >> > > > > -- > Yang Zhang > http://yz.mit.edu/ > -- Yang Zhang http://yz.mit.edu/ _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On the previous post, the lack of example data makes it more time
consuming to help. Please provide a way to generate an example dt. Answers to 2nd two questions inline below ... On Wed, 2011-08-17 at 18:27 -0700, Yang Zhang wrote: > The docs say: > > "When i is a data.table, x must have a key. i is joined to x using the > key and the rows in x that match are returned. An equi-join is > performed between each column in i to each column in x's key. The > match is a binary search in compiled C in O(log n) time. If i has less > columns than x's key then many rows of x may match to each row of i. > If i has more columns than x's key, the columns of i not involved in > the join are included in the result. If i also has a key, it is i's > key columns that are used to match to x's key columns and a binary > merge of the two tables is carried out." > > Some additional quick questions: > > 1. *Which* columns of i are used in the join (assuming no keys are > set)? Is it just left-to-right? Yes, when i has no key, it's left-to-right. > 2. When there are two columns with the same name in x and i (which > aren't being used as join keys), is just the one from x kept? Yes, good question. There is a feature request to provide a "x." and "i." prefix to cope better with this situation; e.g., DT[X,age:=i.date-x.date,roll=TRUE] > > > On Wed, Aug 17, 2011 at 6:26 PM, Yang Zhang <[hidden email]> wrote: > > I'm going to continue here since the question is a bit more > > complicated and SO isn't the best forum for back-and-forth. > > > > If I'm trying to do a join where I'm trying to aggregate counts > > (including 0s for nomatches), is there something more concise than the > > following, which is what I'm currently using since it works? > > > > # assume dt is a data.frame(user_id=..., age=...) > > y = dt[, list(count=length(age)), by=user_id] > > key(y) = 'user_id' > > y = y[J(unique(x$user_id))] > > y$count[is.na(y$count)] = 0 > > > > I tried: > > > >> key(y) = 'user_id' > >> y = y[J(unique(x$user_id)), list(count=length(age))] > >> summary(y$count) > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 1.00 1.00 1.00 75.55 5.00 127200.00 > >> dim(y) > > [1] 7655 2 > > > > which gives me the right number of output rows but none of the lengths > > are 0, presumably because length(NA) == 1. (There are definitely users > > in x that are not in y.) > > > > But then when I tried (and there are no NAs in y$age): > > > >> count = function(x) if (any(is.na(x))) integer(0) else length(x) > >> key(y) = 'user_id' > >> y = y[J(unique(x$user_id)), list(count=count(age))] > >> summary(y$count) > > Min. 1st Qu. Median Mean 3rd Qu. Max. > > 1.0 2.0 6.0 160.4 21.0 127200.0 > >> dim(y) > > [1] 3581 2 > > > > Rows seem to be disappearing, and still the min is 1. > > > > At this point I'm pretty disoriented. Any explanation? Thanks in advance. > > > > > > On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote: > >> Thanks, edited the question. > >> > >> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote: > >>> Yang, > >>> Since you also asked on SO, suggest we answer there (after your edit please) > >>> : > >>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r > >>> Matthew > >>> > >>> > >>> "Yang Zhang" <[hidden email]> wrote in message > >>> news:[hidden email]... > >>>> How do I do the equivalent to the following? > >>>> > >>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 > >>>> union select 2, 1 union select 3, 1 union select 3, 1), > >>>> above as (select a, b from dt where b > .5), > >>>> below as (select a, b from dt where b < .5) > >>>> select above.a, count(below.a) from above left outer join below on > >>>> (above.a = below.a) group by above.a; > >>>> a | count > >>>> ---+------- > >>>> 3 | 0 > >>>> 2 | 1 > >>>> (2 rows) > >>>> > >>>> How do I accomplish the same thing with data.tables? This is what I > >>>> have so far: > >>>> > >>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) > >>>> above = DT[DT$b > .5] > >>>> below = DT[DT$b < .5, list(a=a)] > >>>> key(below) = 'a' > >>>> below[above, list(count=length(a)), by=a] > >>>> > >>>> but this gives me: > >>>> > >>>> a count > >>>> [1,] 2 1 > >>>> [2,] NA 1 > >>>> > >>>> Thanks in advance for any tips. > >>>> > >>>> -- > >>>> Yang Zhang > >>>> http://yz.mit.edu/ > >>> > >>> > >>> > >>> _______________________________________________ > >>> datatable-help mailing list > >>> [hidden email] > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>> > >> > >> > >> > >> -- > >> Yang Zhang > >> http://yz.mit.edu/ > >> > > > > > > > > -- > > Yang Zhang > > http://yz.mit.edu/ > > > > > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
| Powered by Nabble | Edit this page |
