Quantcast

changing data.table by-without-by syntax to require a "by"

classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

changing data.table by-without-by syntax to require a "by"

eddi
Matthew Dowle suggested I put this up for a discussion here.

This is continuation of the discussion that started on SO and resulted in FR2696 (I recommend reading the latter first, as it's much more clear).

My case for the change boils down to the following: I believe d[i, j, by = b] should be always understood to mean

"take d, apply i, return j by b"

instead of the much more complicated current behavior, which is:

"take d, apply i, if i was not a merge, return j by b, if i was a merge, if no by, then return j by key, else if b and b == key, complain and return j by b, else return j by b"

I believe, while disruptive to some current users, this will make data.table much more user-friendly for any future users (one piece of evidence I would suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12) would become completely unnecessary).

This is regarding syntax only, and I do NOT propose any changes to underlying behavior, in particular the speed-up when you do a "by" by the key of the join should stay (and should be done iff by=key is present).

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Michael Nelson
I think this proposed  change is completely unnecessary. 

That a function may behave differently is entirely consistent with the various s3 /s4 methods system (although neither are used here). 

I think that drop = TRUE when implemented will take care of dropping join columns. 



On 20/04/2013, at 5:54 AM, "eddi" <[hidden email]> wrote:

Matthew Dowle suggested I put this up for a discussion here.

This is continuation of the discussion that started on SO and resulted in FR2696 (I recommend reading the latter first, as it's much more clear).

My case for the change boils down to the following: I believe d[i, j, by = b] should be always understood to mean

"take d, apply i, return j by b"

instead of the much more complicated current behavior, which is:

"take d, apply i, if i was not a merge, return j by b, if i was a merge, if no by, then return j by key, else if b and b == key, complain and return j by b, else return j by b"

I believe, while disruptive to some current users, this will make data.table much more user-friendly for any future users (one piece of evidence I would suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12) would become completely unnecessary).

This is regarding syntax only, and I do NOT propose any changes to underlying behavior, in particular the speed-up when you do a "by" by the key of the join should stay (and should be done iff by=key is present).


View this message in context: changing data.table by-without-by syntax to require a "by"
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
I think you're missing the point Michael. Just because it's possible to do it the way it's done now, doesn't mean that's the best way, as I've tried to argue in the OP. I don't think you've addressed the issue of unnecessary complexity pointed out in OP.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Sadao Milberg
I'd agree with Eduard, although it's probably too late to change behavior now.  Maybe for data.table.2?  Eduard's proposal seems more closely aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if requested).

S.

> Date: Mon, 22 Apr 2013 08:17:59 -0700

> From: [hidden email]
> To: [hidden email]
> Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"
>
> I think you're missing the point Michael. Just because it's possible to do it
> the way it's done now, doesn't mean that's the best way, as I've tried to
> argue in the OP. I don't think you've addressed the issue of unnecessary
> complexity pointed out in OP.
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax to require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>      _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).

Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:

"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle

 

Where I meant CROSS APPLY not CROSS BY (typo) and incorrect with 2 r's.  I picked up on that because out of the entire page you seemed to quote a sentence which made no sense.  The rest of the article looks great.

 

On 24.04.2013 23:28, Matthew Dowle wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
In reply to this post by Matthew Dowle
I assumed they meant create a table :)

that looks cool, what's i.top ? I can get a very similar to yours result by writing:

X[Y][, head(.SD, top[1]), by = a]

and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):

X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On <a href="tel:24.04.2013%2022" value="+12404201322" target="_blank">24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1 4
1>

 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1 4
1>

 

On <a href="tel:24.04.2013%2023" value="+12404201323" target="_blank">24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 

To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.


On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1 4
1>

 

On <a href="tel:24.04.2013%2023" value="+12404201323" target="_blank">24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle

 

I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 

Matthew

 

On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>

 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.

The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 

I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 

I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 

Matthew

 

On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>

 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Sadao Milberg
Whatever the "right" way to do things is, the key issue is that default behavior should not be changed since existing code will rely on it.  So even though I tend to agree with Eduard, I would strongly advocate against any change in current behavior.  This aside, let me throw my 2 pennies in for the sake of data.table.2:

As for CROSS APPLY, to be honest, my experience with SQL has been primarily with MySQL < 5 so I didn't even know that existed.  As for your specific example a couple of e-mails ago, I believe this works:

X = data.table(a=1:3,b=1:15, key="a")
Y = data.table(a=c(1,2,1), top=c(3,4,2))
X[Y][, head(.SD, top[1]), by=list(a, top)]

Granted, this is somewhat inefficient since we now have the `top` vector replicated for each value of `a` in `X`.  You can probably come up with other examples that are inefficient or just don't work (e.g. `Y = data.table(a=c(1,2,1, 1), top=c(3,4,2,2))`), but the point here isn't whether you should allow CROSS APPLY or not, but what the "correct" syntax for invoking cross apply is.

I would argue that the correct output to:

X[Y, sum(a * top)]

Should be 21, not:

   a V1
1: 1  3
2: 2  8
3: 1  2

While the output above may be convenient to you, it is not intuitive at all.  In fact, it is an advanced caveat to standard behavior ("J is an expression evaluated in the context of X") that isn't straigthforward to circumvent, and would likely bewilder most beginner users of data.table.  I think given the parallels between data.table and SQL, "X[Y, sum(a * top)]" should mean "SELECT sum(X.a * Y.top) FROM X INNER JOIN Y USING(a)", not some more complex expression involving a CROSS APPLY.  Note that if you want a CROSS APPLY in SQL, you have to ask for it (I guess I picked at terrible example here, since the GROUP is implied...).

I think the "correct" way to do the original task would be something along the lines of:

X[Y, head(.SD, i.top), cross.apply=TRUE]

or some such.

That said, data.table is yours.  It is a fantastic tool, and if you want to behave in a manner that simplifies your work rather than matches the intuitions of others, then it is your hard earned right that I fully respect.

Slightly off topic, why aren't the columns from the Y table available in joint inherited scope when not doing a by without by?  I find it odd that:

X[Y, sum(a * top), by=b]

Produces:
Error in `[.data.table`(X, Y, sum(a * top), by = b) : 
  object 'top' not found

Finally, is i.top documented?

S.


From: [hidden email]
Date: Thu, 25 Apr 2013 07:45:45 -0500
To: [hidden email]
CC: [hidden email]
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.

The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 

I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 
I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew
 
On 25.04.2013 05:16, Eduard Antonyan wrote:
That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:
 
i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.
What about this? :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>
 
On 24.04.2013 23:43, Eduard Antonyan wrote:
I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:
 
That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?
Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 
On 24.04.2013 22:22, Eduard Antonyan wrote:
By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 
 
 
 
 
 

_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle
In reply to this post by eddi

 

I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.

Thanks,

Matthew

 

On 25.04.2013 13:45, Eduard Antonyan wrote:

Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 

I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 

Matthew

 

On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>

 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Matthew Dowle

 

I didn't get any feedback off list on this one.

But I'm coming round to the idea.

What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.

by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.

To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.

Or ... instead of :

    X[Y, j, by=.JOIN]

what about :

    X[by=Y, j]

Matthew

 

On 25.04.2013 16:32, Matthew Dowle wrote:

 

I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.

Thanks,

Matthew

 

On 25.04.2013 13:45, Eduard Antonyan wrote:

Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 

I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 

Matthew

 

On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:

 

i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>

 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:

 

That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 

 

 

 

 

 

 

 

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Sadao Milberg
Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by.  The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such.

This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that.

Loosely related, what does .JOIN represent?  Is it just a flag, or is it a derived variable the way .SD is?  If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data.

Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode?  That would be great.

S.



Date: Fri, 26 Apr 2013 12:14:02 +0100
From: [hidden email]
To: [hidden email]
CC: [hidden email]
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

 
I didn't get any feedback off list on this one.
But I'm coming round to the idea.
What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.
by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.
To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.
Or ... instead of :
    X[Y, j, by=.JOIN]
what about :
    X[by=Y, j]
Matthew
 
On 25.04.2013 16:32, Matthew Dowle wrote:
 
I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.
Thanks,
Matthew
 
On 25.04.2013 13:45, Eduard Antonyan wrote:
Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 
I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew
 
On 25.04.2013 05:16, Eduard Antonyan wrote:
That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:
 
i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.
What about this? :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>
 
On 24.04.2013 23:43, Eduard Antonyan wrote:
I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:
 
That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?
Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 
On 24.04.2013 22:22, Eduard Antonyan wrote:
By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 
 
 
 
 
 
 
 
 
 

_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

eddi
I indeed offered .J as a shorthand for .JOIN and to ease the pain of having to type extra stuff for users who are relying on current behavior.

Sadao is making good points. The question of what does by=list(a, .JOIN) do can still apply though with cross.apply=TRUE syntax, i.e. what does X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for either syntax - in addition to the cross-apply-by it would group by column 'a'. Btw I think Matthew's examples above (or smth like them) should go into the FAQ or documentation as they were very illuminating and entirely non-obvious to me.

If I were to rate all of the above from imo best to worst, it would be:
.JOIN (or .J - yes, I'm biased:) )
.EACHI/cross.apply=TRUE
.EACHIROW/.EACHJOIN
.CROSSAPPLY
X[by=Y,j]

After typing the above list, I'm actually starting to like .EACHI (each.i=TRUE? <- I like this even better) more and more as it seems to convey the meaning (as far as I currently understand it - my understanding has shifted a little since the start of this conversation) really well.

Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I think this conveys the right meaning, satisfies Sadao's points and also has a meaning that transitions well between having a join-i and not having a join-i (when you're not joining, specifying this option wouldn't do anything extra).


On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg <[hidden email]> wrote:
Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by.  The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such.

This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that.

Loosely related, what does .JOIN represent?  Is it just a flag, or is it a derived variable the way .SD is?  If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data.

Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode?  That would be great.

S.



Date: Fri, 26 Apr 2013 12:14:02 +0100
From: [hidden email]
To: [hidden email]
CC: [hidden email]

Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

 
I didn't get any feedback off list on this one.
But I'm coming round to the idea.
What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.
by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.
To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.
Or ... instead of :
    X[Y, j, by=.JOIN]
what about :
    X[by=Y, j]
Matthew
 
On <a href="tel:25.04.2013%2016" value="+12504201316" target="_blank">25.04.2013 16:32, Matthew Dowle wrote:
 
I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.
Thanks,
Matthew
 
On <a href="tel:25.04.2013%2013" value="+12504201313" target="_blank">25.04.2013 13:45, Eduard Antonyan wrote:
Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 
I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew
 
On <a href="tel:25.04.2013%2005" value="+12504201305" target="_blank">25.04.2013 05:16, Eduard Antonyan wrote:
That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:
 
i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.
What about this? :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>
 
On 24.04.2013 23:43, Eduard Antonyan wrote:
I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:
 
That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?
Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 
On 24.04.2013 22:22, Eduard Antonyan wrote:
By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 
 
 
 
 
 
 
 
 
 

_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: changing data.table by-without-by syntax to require a "by"

Sadao Milberg
each.i = TRUE sounds fine to me.


Date: Fri, 26 Apr 2013 10:17:28 -0500
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"
From: [hidden email]
To: [hidden email]
CC: [hidden email]; [hidden email]

I indeed offered .J as a shorthand for .JOIN and to ease the pain of having to type extra stuff for users who are relying on current behavior.

Sadao is making good points. The question of what does by=list(a, .JOIN) do can still apply though with cross.apply=TRUE syntax, i.e. what does X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for either syntax - in addition to the cross-apply-by it would group by column 'a'. Btw I think Matthew's examples above (or smth like them) should go into the FAQ or documentation as they were very illuminating and entirely non-obvious to me.

If I were to rate all of the above from imo best to worst, it would be:
.JOIN (or .J - yes, I'm biased:) )
.EACHI/cross.apply=TRUE
.EACHIROW/.EACHJOIN
.CROSSAPPLY
X[by=Y,j]

After typing the above list, I'm actually starting to like .EACHI (each.i=TRUE? <- I like this even better) more and more as it seems to convey the meaning (as far as I currently understand it - my understanding has shifted a little since the start of this conversation) really well.

Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I think this conveys the right meaning, satisfies Sadao's points and also has a meaning that transitions well between having a join-i and not having a join-i (when you're not joining, specifying this option wouldn't do anything extra).


On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg <[hidden email]> wrote:
Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by.  The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such.

This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that.

Loosely related, what does .JOIN represent?  Is it just a flag, or is it a derived variable the way .SD is?  If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data.

Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode?  That would be great.

S.



Date: Fri, 26 Apr 2013 12:14:02 +0100
From: [hidden email]
To: [hidden email]
CC: [hidden email]

Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

 
I didn't get any feedback off list on this one.
But I'm coming round to the idea.
What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.
by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.
To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.
Or ... instead of :
    X[Y, j, by=.JOIN]
what about :
    X[by=Y, j]
Matthew
 
On 25.04.2013 16:32, Matthew Dowle wrote:
 
I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.
Thanks,
Matthew
 
On 25.04.2013 13:45, Eduard Antonyan wrote:
Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[hidden email]> wrote:

 
I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

x+y

Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew
 
On 25.04.2013 05:16, Eduard Antonyan wrote:
That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[hidden email]> wrote:

that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[hidden email]> wrote:
 
i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.
What about this? :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1>
 
On 24.04.2013 23:43, Eduard Antonyan wrote:
I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[hidden email]> wrote:
 
That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?
Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :
1> X = data.table(a=1:3,b=1:15, key="a")
1> X
a b
1: 1 1
2: 1 4
3: 1 7
4: 1 10
5: 1 13
6: 2 2
7: 2 5
8: 2 8
9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
a top
1: 1 3
2: 2 4
1> X[Y, head(.SD,i.top)]
a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
1>
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
 
On 24.04.2013 22:22, Eduard Antonyan wrote:
By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"




On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[hidden email]> wrote:
But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax       to      require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>                                         _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 
 
 
 
 
 
 
 
 
 

_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
123
Loading...