Quantcast

Basic join question

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Basic join question

Yang Zhang
How do I do the equivalent to the following?

with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
union select 2, 1 union select 3, 1 union select 3, 1),
  above as (select a, b from dt where b > .5),
  below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on
(above.a = below.a) group by above.a;
 a | count
---+-------
 3 |     0
 2 |     1
(2 rows)

How do I accomplish the same thing with data.tables?  This is what I
have so far:

DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]
key(below) = 'a'
below[above, list(count=length(a)), by=a]

but this gives me:

      a count
[1,]  2 1
[2,] NA 1

Thanks in advance for any tips.

--
Yang Zhang
http://yz.mit.edu/
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Basic join question

Matthew Dowle
Yang,
Since you also asked on SO, suggest we answer there (after your edit please)
:
http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
Matthew


"Yang Zhang" <[hidden email]> wrote in message
news:[hidden email]...

> How do I do the equivalent to the following?
>
> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
> union select 2, 1 union select 3, 1 union select 3, 1),
>  above as (select a, b from dt where b > .5),
>  below as (select a, b from dt where b < .5)
> select above.a, count(below.a) from above left outer join below on
> (above.a = below.a) group by above.a;
> a | count
> ---+-------
> 3 |     0
> 2 |     1
> (2 rows)
>
> How do I accomplish the same thing with data.tables?  This is what I
> have so far:
>
> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
> above = DT[DT$b > .5]
> below = DT[DT$b < .5, list(a=a)]
> key(below) = 'a'
> below[above, list(count=length(a)), by=a]
>
> but this gives me:
>
>      a count
> [1,]  2 1
> [2,] NA 1
>
> Thanks in advance for any tips.
>
> --
> Yang Zhang
> http://yz.mit.edu/ 



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Basic join question

Yang Zhang
Thanks, edited the question.

On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote:

> Yang,
> Since you also asked on SO, suggest we answer there (after your edit please)
> :
> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
> Matthew
>
>
> "Yang Zhang" <[hidden email]> wrote in message
> news:[hidden email]...
>> How do I do the equivalent to the following?
>>
>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
>> union select 2, 1 union select 3, 1 union select 3, 1),
>>  above as (select a, b from dt where b > .5),
>>  below as (select a, b from dt where b < .5)
>> select above.a, count(below.a) from above left outer join below on
>> (above.a = below.a) group by above.a;
>> a | count
>> ---+-------
>> 3 |     0
>> 2 |     1
>> (2 rows)
>>
>> How do I accomplish the same thing with data.tables?  This is what I
>> have so far:
>>
>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
>> above = DT[DT$b > .5]
>> below = DT[DT$b < .5, list(a=a)]
>> key(below) = 'a'
>> below[above, list(count=length(a)), by=a]
>>
>> but this gives me:
>>
>>      a count
>> [1,]  2 1
>> [2,] NA 1
>>
>> Thanks in advance for any tips.
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>



--
Yang Zhang
http://yz.mit.edu/
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Basic join question

Yang Zhang
I'm going to continue here since the question is a bit more
complicated and SO isn't the best forum for back-and-forth.

If I'm trying to do a join where I'm trying to aggregate counts
(including 0s for nomatches), is there something more concise than the
following, which is what I'm currently using since it works?

# assume dt is a data.frame(user_id=..., age=...)
y = dt[, list(count=length(age)), by=user_id]
key(y) = 'user_id'
y = y[J(unique(x$user_id))]
y$count[is.na(y$count)] = 0

I tried:

> key(y) = 'user_id'
> y = y[J(unique(x$user_id)), list(count=length(age))]
> summary(y$count)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
     1.00      1.00      1.00     75.55      5.00 127200.00
> dim(y)
[1] 7655    2

which gives me the right number of output rows but none of the lengths
are 0, presumably because length(NA) == 1. (There are definitely users
in x that are not in y.)

But then when I tried (and there are no NAs in y$age):

> count = function(x) if (any(is.na(x))) integer(0) else length(x)
> key(y) = 'user_id'
> y = y[J(unique(x$user_id)), list(count=count(age))]
> summary(y$count)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
     1.0      2.0      6.0    160.4     21.0 127200.0
> dim(y)
[1] 3581    2

Rows seem to be disappearing, and still the min is 1.

At this point I'm pretty disoriented. Any explanation? Thanks in advance.


On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote:

> Thanks, edited the question.
>
> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote:
>> Yang,
>> Since you also asked on SO, suggest we answer there (after your edit please)
>> :
>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
>> Matthew
>>
>>
>> "Yang Zhang" <[hidden email]> wrote in message
>> news:[hidden email]...
>>> How do I do the equivalent to the following?
>>>
>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
>>> union select 2, 1 union select 3, 1 union select 3, 1),
>>>  above as (select a, b from dt where b > .5),
>>>  below as (select a, b from dt where b < .5)
>>> select above.a, count(below.a) from above left outer join below on
>>> (above.a = below.a) group by above.a;
>>> a | count
>>> ---+-------
>>> 3 |     0
>>> 2 |     1
>>> (2 rows)
>>>
>>> How do I accomplish the same thing with data.tables?  This is what I
>>> have so far:
>>>
>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
>>> above = DT[DT$b > .5]
>>> below = DT[DT$b < .5, list(a=a)]
>>> key(below) = 'a'
>>> below[above, list(count=length(a)), by=a]
>>>
>>> but this gives me:
>>>
>>>      a count
>>> [1,]  2 1
>>> [2,] NA 1
>>>
>>> Thanks in advance for any tips.
>>>
>>> --
>>> Yang Zhang
>>> http://yz.mit.edu/
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
>



--
Yang Zhang
http://yz.mit.edu/
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Basic join question

Yang Zhang
The docs say:

"When i is a data.table, x must have a key. i is joined to x using the
key and the rows in x that match are returned. An equi-join is
performed between each column in i to each column in x's key. The
match is a binary search in compiled C in O(log n) time. If i has less
columns than x's key then many rows of x may match to each row of i.
If i has more columns than x's key, the columns of i not involved in
the join are included in the result. If i also has a key, it is i's
key columns that are used to match to x's key columns and a binary
merge of the two tables is carried out."

Some additional quick questions:

1. *Which* columns of i are used in the join (assuming no keys are
set)? Is it just left-to-right?

2. When there are two columns with the same name in x and i (which
aren't being used as join keys), is just the one from x kept?


On Wed, Aug 17, 2011 at 6:26 PM, Yang Zhang <[hidden email]> wrote:

> I'm going to continue here since the question is a bit more
> complicated and SO isn't the best forum for back-and-forth.
>
> If I'm trying to do a join where I'm trying to aggregate counts
> (including 0s for nomatches), is there something more concise than the
> following, which is what I'm currently using since it works?
>
> # assume dt is a data.frame(user_id=..., age=...)
> y = dt[, list(count=length(age)), by=user_id]
> key(y) = 'user_id'
> y = y[J(unique(x$user_id))]
> y$count[is.na(y$count)] = 0
>
> I tried:
>
>> key(y) = 'user_id'
>> y = y[J(unique(x$user_id)), list(count=length(age))]
>> summary(y$count)
>     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
>     1.00      1.00      1.00     75.55      5.00 127200.00
>> dim(y)
> [1] 7655    2
>
> which gives me the right number of output rows but none of the lengths
> are 0, presumably because length(NA) == 1. (There are definitely users
> in x that are not in y.)
>
> But then when I tried (and there are no NAs in y$age):
>
>> count = function(x) if (any(is.na(x))) integer(0) else length(x)
>> key(y) = 'user_id'
>> y = y[J(unique(x$user_id)), list(count=count(age))]
>> summary(y$count)
>    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
>     1.0      2.0      6.0    160.4     21.0 127200.0
>> dim(y)
> [1] 3581    2
>
> Rows seem to be disappearing, and still the min is 1.
>
> At this point I'm pretty disoriented. Any explanation? Thanks in advance.
>
>
> On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote:
>> Thanks, edited the question.
>>
>> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote:
>>> Yang,
>>> Since you also asked on SO, suggest we answer there (after your edit please)
>>> :
>>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
>>> Matthew
>>>
>>>
>>> "Yang Zhang" <[hidden email]> wrote in message
>>> news:[hidden email]...
>>>> How do I do the equivalent to the following?
>>>>
>>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
>>>> union select 2, 1 union select 3, 1 union select 3, 1),
>>>>  above as (select a, b from dt where b > .5),
>>>>  below as (select a, b from dt where b < .5)
>>>> select above.a, count(below.a) from above left outer join below on
>>>> (above.a = below.a) group by above.a;
>>>> a | count
>>>> ---+-------
>>>> 3 |     0
>>>> 2 |     1
>>>> (2 rows)
>>>>
>>>> How do I accomplish the same thing with data.tables?  This is what I
>>>> have so far:
>>>>
>>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
>>>> above = DT[DT$b > .5]
>>>> below = DT[DT$b < .5, list(a=a)]
>>>> key(below) = 'a'
>>>> below[above, list(count=length(a)), by=a]
>>>>
>>>> but this gives me:
>>>>
>>>>      a count
>>>> [1,]  2 1
>>>> [2,] NA 1
>>>>
>>>> Thanks in advance for any tips.
>>>>
>>>> --
>>>> Yang Zhang
>>>> http://yz.mit.edu/
>>>
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [hidden email]
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>>
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
>



--
Yang Zhang
http://yz.mit.edu/
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Basic join question

Matthew Dowle
On the previous post, the lack of example data makes it more time
consuming to help. Please provide a way to generate an example dt.

Answers to 2nd two questions inline below ...

On Wed, 2011-08-17 at 18:27 -0700, Yang Zhang wrote:

> The docs say:
>
> "When i is a data.table, x must have a key. i is joined to x using the
> key and the rows in x that match are returned. An equi-join is
> performed between each column in i to each column in x's key. The
> match is a binary search in compiled C in O(log n) time. If i has less
> columns than x's key then many rows of x may match to each row of i.
> If i has more columns than x's key, the columns of i not involved in
> the join are included in the result. If i also has a key, it is i's
> key columns that are used to match to x's key columns and a binary
> merge of the two tables is carried out."
>
> Some additional quick questions:
>
> 1. *Which* columns of i are used in the join (assuming no keys are
> set)? Is it just left-to-right?

Yes, when i has no key, it's left-to-right.

> 2. When there are two columns with the same name in x and i (which
> aren't being used as join keys), is just the one from x kept?

Yes, good question. There is a feature request to provide a "x." and
"i." prefix to cope better with this situation; e.g.,
    DT[X,age:=i.date-x.date,roll=TRUE]

>
>
> On Wed, Aug 17, 2011 at 6:26 PM, Yang Zhang <[hidden email]> wrote:
> > I'm going to continue here since the question is a bit more
> > complicated and SO isn't the best forum for back-and-forth.
> >
> > If I'm trying to do a join where I'm trying to aggregate counts
> > (including 0s for nomatches), is there something more concise than the
> > following, which is what I'm currently using since it works?
> >
> > # assume dt is a data.frame(user_id=..., age=...)
> > y = dt[, list(count=length(age)), by=user_id]
> > key(y) = 'user_id'
> > y = y[J(unique(x$user_id))]
> > y$count[is.na(y$count)] = 0
> >
> > I tried:
> >
> >> key(y) = 'user_id'
> >> y = y[J(unique(x$user_id)), list(count=length(age))]
> >> summary(y$count)
> >     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
> >     1.00      1.00      1.00     75.55      5.00 127200.00
> >> dim(y)
> > [1] 7655    2
> >
> > which gives me the right number of output rows but none of the lengths
> > are 0, presumably because length(NA) == 1. (There are definitely users
> > in x that are not in y.)
> >
> > But then when I tried (and there are no NAs in y$age):
> >
> >> count = function(x) if (any(is.na(x))) integer(0) else length(x)
> >> key(y) = 'user_id'
> >> y = y[J(unique(x$user_id)), list(count=count(age))]
> >> summary(y$count)
> >    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
> >     1.0      2.0      6.0    160.4     21.0 127200.0
> >> dim(y)
> > [1] 3581    2
> >
> > Rows seem to be disappearing, and still the min is 1.
> >
> > At this point I'm pretty disoriented. Any explanation? Thanks in advance.
> >
> >
> > On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <[hidden email]> wrote:
> >> Thanks, edited the question.
> >>
> >> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <[hidden email]> wrote:
> >>> Yang,
> >>> Since you also asked on SO, suggest we answer there (after your edit please)
> >>> :
> >>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
> >>> Matthew
> >>>
> >>>
> >>> "Yang Zhang" <[hidden email]> wrote in message
> >>> news:[hidden email]...
> >>>> How do I do the equivalent to the following?
> >>>>
> >>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
> >>>> union select 2, 1 union select 3, 1 union select 3, 1),
> >>>>  above as (select a, b from dt where b > .5),
> >>>>  below as (select a, b from dt where b < .5)
> >>>> select above.a, count(below.a) from above left outer join below on
> >>>> (above.a = below.a) group by above.a;
> >>>> a | count
> >>>> ---+-------
> >>>> 3 |     0
> >>>> 2 |     1
> >>>> (2 rows)
> >>>>
> >>>> How do I accomplish the same thing with data.tables?  This is what I
> >>>> have so far:
> >>>>
> >>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
> >>>> above = DT[DT$b > .5]
> >>>> below = DT[DT$b < .5, list(a=a)]
> >>>> key(below) = 'a'
> >>>> below[above, list(count=length(a)), by=a]
> >>>>
> >>>> but this gives me:
> >>>>
> >>>>      a count
> >>>> [1,]  2 1
> >>>> [2,] NA 1
> >>>>
> >>>> Thanks in advance for any tips.
> >>>>
> >>>> --
> >>>> Yang Zhang
> >>>> http://yz.mit.edu/
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> datatable-help mailing list
> >>> [hidden email]
> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>>
> >>
> >>
> >>
> >> --
> >> Yang Zhang
> >> http://yz.mit.edu/
> >>
> >
> >
> >
> > --
> > Yang Zhang
> > http://yz.mit.edu/
> >
>
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...