Quantcast

Follow-up on subsetting data.table with NAs

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
Matthew,

Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!).

1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,]
However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ]

2) Second, you mention that the motivation is not just convenience but speed. By checking:

require(data.table)
set.seed(45)
df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2))
dt <- data.table(df)
system.time(dt[V1 == V2])
# 0.077 seconds
system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ])
# 0.252 seconds
system.time(df[which(df$V1 == df$V2), ])
# 0.038 seconds

We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`:

        if (is.logical(i)) {
            if (identical(i,NA)) i = NA_integer_  # see DT[NA] thread re recycling of NA logical
            else i[is.na(i)] = FALSE              # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        }

But at the end `irows <- which(i)` is being done:

            if (is.logical(i)) {
                if (length(i)==nrow(x)) irows=which(i)   # e.g. DT[colA>3,which=TRUE]

And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later.

3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"), then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example:

x <- c(1,2,3,NA)
x != 3
# TRUE TRUE FALSE NA

makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore.

I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well.

Best regards,

Arun


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle

On 09.06.2013 22:08, Arunkumar Srinivasan wrote:

Matthew,
Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!).
1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,]
However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ]
Good point. But DT[ColA == ColB] still seems simpler than DF[which(DF$ColA == DF$ColB), ]  (in data.table  DT[which(ColA == ColB)]).   I worry about forgetting I need which() and then have bugs occur when NA occur in the data at some time in future that don't occur now or in test.
2) Second, you mention that the motivation is not just convenience but speed. By checking:
require(data.table)
set.seed(45)
df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2))
dt <- data.table(df)
system.time(dt[V1 == V2])
# 0.077 seconds
system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ])
# 0.252 seconds
system.time(df[which(df$V1 == df$V2), ])
# 0.038 seconds
We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`:
        if (is.logical(i)) {
            if (identical(i,NA)) i = NA_integer_  # see DT[NA] thread re recycling of NA logical
            else i[is.na(i)] = FALSE              # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        }
But at the end `irows <- which(i)` is being done:
            if (is.logical(i)) {
                if (length(i)==nrow(x)) irows=which(i)   # e.g. DT[colA>3,which=TRUE]
And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later.
Interesting.  Cool, so dt[V1 == V2] can and should be at least as fast as the which() way.  Will file a FR to improve that speed!
3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"),
Not sure that is agreed yet, but happy to be persuaded.
then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example:
x <- c(1,2,3,NA)
x != 3
# TRUE TRUE FALSE NA
makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore.
True but the context of where that result is used is all important; i.e., in this case that's `i` of [.data.table or [.data.frame.  It may be easier to consider == first.  The data.table philosophy is that DT [ x==3 ]  should exclude any rows in x that are NA,  without needing to do anything special such as needing to know to call which() as well.  That differs to data.frame,  but is more consistent with SQL.  In SQL "where x = 3" doesn't need anything else if x contains some NULL values.
I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well.
Best regards,
Arun

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
Matthew,

I personally don't think using "which" takes the simplicity away from the syntax. However, since it's (now) clear (to me) that the philosophy of data.table relates more towards SQL, I don't see a reason for "which". 

Even in the context of `[` in data.table/data.frame, "missing/unknown" data could be related to R philosophy. 

dt <- data.table(x=c(1,3,4,NA), y=c(1:4))
dt[x <= 3]

Here, one could argue that we don't know if the 4th row missing value is <= 3 or not. So, the problem comes to a point about what is the action to be taken. Do you give back the rows where no decision could be made or not? But as you rightly pointed out the idea behind data.table to be SQL-like, the current output stands very much. So retaining NA rows becomes invalid as well.

Regarding FR4652, thanks for the speedy filing of this! I'm glad to have spotted it.

Best regards,
Arun.

On Sunday, June 9, 2013 at 11:47 PM, Matthew Dowle wrote:

On 09.06.2013 22:08, Arunkumar Srinivasan wrote:

Matthew,
Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!).
1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,]
However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ]
Good point. But DT[ColA == ColB] still seems simpler than DF[which(DF$ColA == DF$ColB), ]  (in data.table  DT[which(ColA == ColB)]).   I worry about forgetting I need which() and then have bugs occur when NA occur in the data at some time in future that don't occur now or in test.
2) Second, you mention that the motivation is not just convenience but speed. By checking:
require(data.table)
set.seed(45)
df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2))
dt <- data.table(df)
system.time(dt[V1 == V2])
# 0.077 seconds
system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ])
# 0.252 seconds
system.time(df[which(df$V1 == df$V2), ])
# 0.038 seconds
We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`:
        if (is.logical(i)) {
            if (identical(i,NA)) i = NA_integer_  # see DT[NA] thread re recycling of NA logical
            else i[is.na(i)] = FALSE              # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        }
But at the end `irows <- which(i)` is being done:
            if (is.logical(i)) {
                if (length(i)==nrow(x)) irows=which(i)   # e.g. DT[colA>3,which=TRUE]
And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later.
Interesting.  Cool, so dt[V1 == V2] can and should be at least as fast as the which() way.  Will file a FR to improve that speed!
3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"),
Not sure that is agreed yet, but happy to be persuaded.
then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example:
x <- c(1,2,3,NA)
x != 3
# TRUE TRUE FALSE NA
makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore.
True but the context of where that result is used is all important; i.e., in this case that's `i` of [.data.table or [.data.frame.  It may be easier to consider == first.  The data.table philosophy is that DT [ x==3 ]  should exclude any rows in x that are NA,  without needing to do anything special such as needing to know to call which() as well.  That differs to data.frame,  but is more consistent with SQL.  In SQL "where x = 3" doesn't need anything else if x contains some NULL values.
I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well.
Best regards,
Arun

 

 
_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
Matthew,

Regarding your suggestion of changes regarding Frank's post here: http://stackoverflow.com/a/17008872/559784 I find it a bit more confusing and frankly not like sql.

You wrote: "If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where bothcolA and colB are NA (which it doesn't currently, I believe)."

Even though sql (ex: sqldf) has a different way of handling NAs when compared to data.frame, it doesn't seem to find NA == NA. That is,

df <- data.frame(x = c(1:3,NA), y = c(NA,4:5,NA))
require(sqldf)

sqldf("select * from df where x == y")
# returns empty data.frame

sqldf("select * from df where x != y")
  x y
1 2 4
2 3 5

That is, at least in sqldf package, NA is not == NA and NA is not != NA which is very much in coherence with R's default NA == NA and NA != NA (both giving NA). But I don't think they it's considered FALSE here. It just acts like the "subset" function where all entries that were evaluated to NAs are simply dropped. But with data.table philosophy NA != NA should be evaluated to TRUE, which I don't think (from what I meagrely understand from sql) is what sql does. Please correct me if I've got it wrong.

I think it is clearer and simpler if "NAs are just dropped" after evaluating logical expressions. It would be also easy to document this and easier to grasp, imho. This would also explain Frank's post for NA rows being removed. 

And probably if there is more consensus an option for "na.rm = TRUE/FALSE" could be added?

Arun


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle

 

Hi Arun,

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  And it was correct for Frank to be mistaken.  Maybe just some more documentation and examples needed then. Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

 

Thanks, Matthew

 

On 10.06.2013 08:11, Arunkumar Srinivasan wrote:

Matthew,
Regarding your suggestion of changes regarding Frank's post here: http://stackoverflow.com/a/17008872/559784 I find it a bit more confusing and frankly not like sql.
You wrote: "If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where bothcolA and colB are NA (which it doesn't currently, I believe)."

Even though sql (ex: sqldf) has a different way of handling NAs when compared to data.frame, it doesn't seem to find NA == NA. That is,
df <- data.frame(x = c(1:3,NA), y = c(NA,4:5,NA))
require(sqldf)
sqldf("select * from df where x == y")
# returns empty data.frame
sqldf("select * from df where x != y")
  x y
1 2 4
2 3 5
That is, at least in sqldf package, NA is not == NA and NA is not != NA which is very much in coherence with R's default NA == NA and NA != NA (both giving NA). But I don't think they it's considered FALSE here. It just acts like the "subset" function where all entries that were evaluated to NAs are simply dropped. But with data.table philosophy NA != NA should be evaluated to TRUE, which I don't think (from what I meagrely understand from sql) is what sql does. Please correct me if I've got it wrong.
I think it is clearer and simpler if "NAs are just dropped" after evaluating logical expressions. It would be also easy to document this and easier to grasp, imho. This would also explain Frank's post for NA rows being removed. 
And probably if there is more consensus an option for "na.rm = TRUE/FALSE" could be added?
Arun

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 

In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).

Best regards,
 
Arun


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 

Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 

In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).

Best regards,
 
Arun



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle

 

Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Matthew

 

On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).
Best regards,
 
Arun

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
In reply to this post by Arunkumar Srinivasan
(Sorry @Matthew for the double email, I forgot to include the list once again).

However, one inconsistency I find with the use of `!(x==.)` is this:

dt1 <- data.table(x = 0:4, y=5:9)
> dt1[!(x)]
   x  y
1: 4 9

Not the correct result! If `!(x==.)` is equal to `x != .`, then the correct result should be the first row, isn't it?

dt2 <- data.table(x = c(0,3,4,NA), y = c(NA,4,5,NA))
> dt2[!(x)] # ends up in an error
Error in seq_len(nrow(x))[-irows] : 
  only 0's may be mixed with negative subscripts

It ends up in an error because `NA` is not removed/replaced.

Running the same on data.frame gives the results it's supposed to.

Arun


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle
In reply to this post by Arunkumar Srinivasan

 

On 10.06.2013 09:53, Arunkumar Srinivasan wrote:

However, one inconsistency I find with the use of `!(x==.)` is this:
dt1 <- data.table(x = 0:4, y=5:9)
> dt1[!(x)]
   x  y
1: 4 10
Not the correct result! If `!(x==.)` is equal to `x != .`, then the correct result should be the first row, isn't it?
That result makes perfect sense to me.   I don't think of !(x==.) being the same as  x!=.    ! is simply a prefix.    It's all the rows that aren't returned if the ! prefix wasn't there.
dt2 <- data.table(x = c(0,3,4,NA), y = c(NA,4,5,NA))
> dt2[!(x)] # ends up in an error
Error in seq_len(nrow(x))[-irows] : 
  only 0's may be mixed with negative subscripts
That needs to be fixed.  But we're getting quite theoretical here and far away from common use cases.  Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column?
It ends up in an error because `NA` is not removed/replaced.
Running the same on data.frame gives the results it's supposed to.
Arun

On Monday, June 10, 2013 at 10:35 AM, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).
Best regards,
Arun

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
In reply to this post by Matthew Dowle
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the same as  x!=.    ! is simply a prefix.    It's all the rows that aren't returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do currently. What I failed to mention was that if one were to consider implementing `!(x==.)` as the same as `x != .` then this behaviour has to be changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far away from common use cases.  Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column?
Probably I did not choose a good example. Suppose that I've a data.table and I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = sample(15)) 

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour of data.table), as it's more intuitive and less confusing. 

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:

 

Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Matthew

 

On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).
Best regards,
 
Arun

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Frank Erickson
+1 to using ~ for the not-join/join on complement/complement then join. Having some logical-looking i's lead to subsetting and others to not-joins can (for me) lead to mistakes that I'm not likely to catch until much later, if at all.

I'm not sure I follow Arun's second example. If the syntax is changed so that ~ works as ! does now, then presumably !x will be reverted to having only a logical interpretation -- coercing x to logical and taking the subset where x == 0 -- which is the behavior you want. So why is it a separate issue? The remaining difference from data.frames would be that DF[!x] would show NA rows, if any, while DT[!x] would not.

--Frank


On Mon, Jun 10, 2013 at 4:21 AM, Arunkumar Srinivasan <[hidden email]> wrote:
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the same as  x!=.    ! is simply a prefix.    It's all the rows that aren't returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do currently. What I failed to mention was that if one were to consider implementing `!(x==.)` as the same as `x != .` then this behaviour has to be changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far away from common use cases.  Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column?
Probably I did not choose a good example. Suppose that I've a data.table and I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = sample(15)) 

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour of data.table), as it's more intuitive and less confusing. 

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:

 

Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Matthew

 

On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).
Best regards,
 
Arun

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
Frank,
You're right about my final point. I can't recollect why I wrote that now. I guess the `!` function will be restored automatically.

With my second example, all I wanted to establish was that there was another reason to change `!` from performing the action of a "Not Join" because `DT[!x]` is a perfectly valid syntax (for those who have worked with data.frames and have shifted to data.table) which will not perform the intended action as it'll be a Not Join. In addition, `DT[!x]` gives an error when "x" column has NA. This was meant to be an additional argument for not having `!` for Not Join. But this has caused more confusion. Let's forget about my examples :).

To conclude, "~" or "NJ" makes sense than `!` for "Not join" and of course the function of `!` will be automatically restored to "not" (also preferably with a na.rm = TRUE/FALSE. This is what I intended to say from the original discussion. Sorry for any confusion.

Arun

On Monday, June 10, 2013 at 3:20 PM, Frank Erickson wrote:

+1 to using ~ for the not-join/join on complement/complement then join. Having some logical-looking i's lead to subsetting and others to not-joins can (for me) lead to mistakes that I'm not likely to catch until much later, if at all.

I'm not sure I follow Arun's second example. If the syntax is changed so that ~ works as ! does now, then presumably !x will be reverted to having only a logical interpretation -- coercing x to logical and taking the subset where x == 0 -- which is the behavior you want. So why is it a separate issue? The remaining difference from data.frames would be that DF[!x] would show NA rows, if any, while DT[!x] would not.

--Frank


On Mon, Jun 10, 2013 at 4:21 AM, Arunkumar Srinivasan <[hidden email]> wrote:
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the same as  x!=.    ! is simply a prefix.    It's all the rows that aren't returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do currently. What I failed to mention was that if one were to consider implementing `!(x==.)` as the same as `x != .` then this behaviour has to be changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far away from common use cases.  Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column?
Probably I did not choose a good example. Suppose that I've a data.table and I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = sample(15)) 

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour of data.table), as it's more intuitive and less confusing. 

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:

 

Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and - available for future use.  NJ() may be possible too.

Matthew

 

On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) 

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and !=, and so no change needed?  

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below).
 

And it was correct for Frank to be mistaken.  

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. 

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. 
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE).
Best regards,
 
Arun

 

 


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Gabor Grothendieck
In reply to this post by Arunkumar Srinivasan
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x).  One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[hidden email]> wrote:

> Matthew,
>
> How about ~ instead of ! ?      I ruled out - previously to leave + and -
> available for future use.  NJ() may be possible too.
>
> Both "NJ()" and "~" are okay for me.
>
> That result makes perfect sense to me.   I don't think of !(x==.) being the
> same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
> returned if the ! prefix wasn't there.
>
> I understand that `DT[!(x)]` does what `data.table` is designed to do
> currently. What I failed to mention was that if one were to consider
> implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
> changed. Let's forget this point for a moment.
>
> That needs to be fixed.  But we're getting quite theoretical here and far
> away from common use cases.  Why would we ever have row numbers of the
> table, as a column of the table itself and want to select the rows by number
> not mentioned in that column?
>
> Probably I did not choose a good example. Suppose that I've a data.table and
> I want to get all rows where "x == 0". Let's say:
>
> set.seed(45)
> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
> sample(15))
>
> DF <- as.data.frame(DT)
>
> To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
> sense, at least in the context of data.frames, to do equivalently,
>
> DF[!(DF$x), ] (or) DF[DF$x == 0, ]
>
> All I want to say is, I expect `DT[!(x)]` should give the same result as
> `DT[x == 0]` (even though I fully understand it's not the intended behaviour
> of data.table), as it's more intuitive and less confusing.
>
> So, changing `!` to `~` or `NJ` is one half of the issue for me. The other
> is to replace the actual function of `!` in all contexts. I hope I came
> across with what I wanted to say, better this time.
>
> Best,
>
> Arun
>
>
> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
>
>
>
> Hi,
>
> How about ~ instead of ! ?      I ruled out - previously to leave + and -
> available for future use.  NJ() may be possible too.
>
> Matthew
>
>
>
> On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
>
> Hi Matthew,
> My view (from the last reply) more or less reflects mnel's comments here:
> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
> Pasted here for convenience:
> data.table is mimicing subset in its handling of NA values in logical i
> arguments. -- the only issue is the ! prefix signifying a not-join, not the
> way one might expect. Perhaps the not join prefix could have been NJ not !
> to avoid this confusion -- this might be another discussion to have on the
> mailing list -- (I think it is a discussion worth having)
>
> Arun
>
> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
>
> Hm, good point.  Is data.table consistent with SQL already, for both == and
> !=, and so no change needed?
>
> Yes, I believe it's already consistent with SQL. However, the current
> interpretation of NA (documentation) being treated as FALSE is not needed /
> untrue, imho (Please see below).
>
>
> And it was correct for Frank to be mistaken.
>
> Yes, it seems like he was mistaken.
>
> Maybe just some more documentation and examples needed then.
>
> It'd be much more appropriate if the documentation reflects the role of
> subsetting in data.table mimicking "subset" function (in order to be in line
> with SQL) by dropping NA evaluated logicals. From a couple of posts before,
> where I pasted the code where NAs are replaced to FALSE were not necessary
> as `irows <- which(i)` makes clear that `which` is being used to get indices
> and then subset, this fits perfectly well with the interpretation of NA in
> data.table.
>
> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :
>
> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
>
>  Ha, I like the idea behind the use of () in evaluating expressions. It's
> another nice layer towards simplicity in data.table. But I still think there
> should not be an inconsistency in equivalent logical operations to provide
> different results. If !(x== .) and x != . are indeed different, then I'd
> suppose replacing `!` with a more appropriate name as it's much easier to
> get confused otherwise.
> In essence, either !(x == .) must evaluate to (x != .) if the underlying
> meaning of these are the same, or the `!` in `!(x==.)` must be replaced to
> something that's more appropriate for what it's supposed to be. Personally,
> I prefer the former. It would greatly tighten the structure and consistency.
>
> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before
> in the context of joins, not logical subsets.
>
> Yes, I find this option would give more control in evaluating expressions
> with ease in `i`, by providing both "subset" (default) and the typical
> data.frame subsetting (na.rm = FALSE).
> Best regards,
>
> Arun
>
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle

Hm, another good point.   We need ~ for formulae,  although I can't
imagine a formula in i (only in j).  But in both i and j we might want
to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression
syntax,  but ^ doesn't parse with a RHS only. Needs to be parsable as a
prefix.

- maybe then?  Consistent with - meaning in R.  I don't think I
actually had a specific use in mind for - and +, to reserve them for,  
but at the time it just seemed a shame to use up one of -/+ without
defining the other.  If - does a not join, then, might + be more like
merge() (i.e. returning the union of the rows in x and i by join).  I
think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE,  but my
thinking there is readability,  since we often have many lines in i, j
and by in that order, and if the "notjoin=TRUE" followed afterwards it
would be far away from the i argument to which it applies.  If we
incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet
more parameters, too.


On 10.06.2013 15:02, Gabor Grothendieck wrote:

> The problem with ~ is that it is using up a special character (of
> which there are only a few) for a case that does not occur much.
>
> I can think of other things that ~ might be better used for.  For
> example, perhaps ~ x could mean get(x).  One aspect of data.table
> that
> tends to be difficult is when you don't know the variable name ahead
> of time and this woiuld give a way to specify it concisely.
>
> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
> <[hidden email]> wrote:
>> Matthew,
>>
>> How about ~ instead of ! ?      I ruled out - previously to leave +
>> and -
>> available for future use.  NJ() may be possible too.
>>
>> Both "NJ()" and "~" are okay for me.
>>
>> That result makes perfect sense to me.   I don't think of !(x==.)
>> being the
>> same as  x!=.    ! is simply a prefix.    It's all the rows that
>> aren't
>> returned if the ! prefix wasn't there.
>>
>> I understand that `DT[!(x)]` does what `data.table` is designed to
>> do
>> currently. What I failed to mention was that if one were to consider
>> implementing `!(x==.)` as the same as `x != .` then this behaviour
>> has to be
>> changed. Let's forget this point for a moment.
>>
>> That needs to be fixed.  But we're getting quite theoretical here
>> and far
>> away from common use cases.  Why would we ever have row numbers of
>> the
>> table, as a column of the table itself and want to select the rows
>> by number
>> not mentioned in that column?
>>
>> Probably I did not choose a good example. Suppose that I've a
>> data.table and
>> I want to get all rows where "x == 0". Let's say:
>>
>> set.seed(45)
>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
>> sample(15))
>>
>> DF <- as.data.frame(DT)
>>
>> To get all rows where x == 0, it could be done with DT[x == 0]. But
>> it makes
>> sense, at least in the context of data.frames, to do equivalently,
>>
>> DF[!(DF$x), ] (or) DF[DF$x == 0, ]
>>
>> All I want to say is, I expect `DT[!(x)]` should give the same
>> result as
>> `DT[x == 0]` (even though I fully understand it's not the intended
>> behaviour
>> of data.table), as it's more intuitive and less confusing.
>>
>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The
>> other
>> is to replace the actual function of `!` in all contexts. I hope I
>> came
>> across with what I wanted to say, better this time.
>>
>> Best,
>>
>> Arun
>>
>>
>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
>>
>>
>>
>> Hi,
>>
>> How about ~ instead of ! ?      I ruled out - previously to leave +
>> and -
>> available for future use.  NJ() may be possible too.
>>
>> Matthew
>>
>>
>>
>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
>>
>> Hi Matthew,
>> My view (from the last reply) more or less reflects mnel's comments
>> here:
>>
>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
>> Pasted here for convenience:
>> data.table is mimicing subset in its handling of NA values in
>> logical i
>> arguments. -- the only issue is the ! prefix signifying a not-join,
>> not the
>> way one might expect. Perhaps the not join prefix could have been NJ
>> not !
>> to avoid this confusion -- this might be another discussion to have
>> on the
>> mailing list -- (I think it is a discussion worth having)
>>
>> Arun
>>
>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
>>
>> Hm, good point.  Is data.table consistent with SQL already, for both
>> == and
>> !=, and so no change needed?
>>
>> Yes, I believe it's already consistent with SQL. However, the
>> current
>> interpretation of NA (documentation) being treated as FALSE is not
>> needed /
>> untrue, imho (Please see below).
>>
>>
>> And it was correct for Frank to be mistaken.
>>
>> Yes, it seems like he was mistaken.
>>
>> Maybe just some more documentation and examples needed then.
>>
>> It'd be much more appropriate if the documentation reflects the role
>> of
>> subsetting in data.table mimicking "subset" function (in order to be
>> in line
>> with SQL) by dropping NA evaluated logicals. From a couple of posts
>> before,
>> where I pasted the code where NAs are replaced to FALSE were not
>> necessary
>> as `irows <- which(i)` makes clear that `which` is being used to get
>> indices
>> and then subset, this fits perfectly well with the interpretation of
>> NA in
>> data.table.
>>
>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA
>> inconsistently? :
>>
>>
>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
>>
>>  Ha, I like the idea behind the use of () in evaluating expressions.
>> It's
>> another nice layer towards simplicity in data.table. But I still
>> think there
>> should not be an inconsistency in equivalent logical operations to
>> provide
>> different results. If !(x== .) and x != . are indeed different, then
>> I'd
>> suppose replacing `!` with a more appropriate name as it's much
>> easier to
>> get confused otherwise.
>> In essence, either !(x == .) must evaluate to (x != .) if the
>> underlying
>> meaning of these are the same, or the `!` in `!(x==.)` must be
>> replaced to
>> something that's more appropriate for what it's supposed to be.
>> Personally,
>> I prefer the former. It would greatly tighten the structure and
>> consistency.
>>
>> "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch
>> before
>> in the context of joins, not logical subsets.
>>
>> Yes, I find this option would give more control in evaluating
>> expressions
>> with ease in `i`, by providing both "subset" (default) and the
>> typical
>> data.frame subsetting (na.rm = FALSE).
>> Best regards,
>>
>> Arun
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

eddi
I don't have much to add, except to +1 the suggestion of restoring ! to mean a logical not instead of not-joining as !(x == 0) and x != 0 or (!(x == 0)) giving different results is just too hard to understand and requires some advanced understanding of what ! means, and how it's parsed internally.


On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle <[hidden email]> wrote:

Hm, another good point.   We need ~ for formulae,  although I can't imagine a formula in i (only in j).  But in both i and j we might want to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax,  but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix.

- maybe then?  Consistent with - meaning in R.  I don't think I actually had a specific use in mind for - and +, to reserve them for,  but at the time it just seemed a shame to use up one of -/+ without defining the other.  If - does a not join, then, might + be more like merge() (i.e. returning the union of the rows in x and i by join).  I think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE,  but my thinking there is readability,  since we often have many lines in i, j and by in that order, and if the "notjoin=TRUE" followed afterwards it would be far away from the i argument to which it applies.  If we incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet more parameters, too.



On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x).  One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[hidden email]> wrote:
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the
same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far
away from common use cases.  Why would we ever have row numbers of the
table, as a column of the table itself and want to select the rows by number
not mentioned in that column?

Probably I did not choose a good example. Suppose that I've a data.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as
`DT[x == 0]` (even though I fully understand it's not the intended behaviour
of data.table), as it's more intuitive and less confusing.

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other
is to replace the actual function of `!` in all contexts. I hope I came
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here:

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i
arguments. -- the only issue is the ! prefix signifying a not-join, not the
way one might expect. Perhaps the not join prefix could have been NJ not !
to avoid this confusion -- this might be another discussion to have on the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and
!=, and so no change needed?

Yes, I believe it's already consistent with SQL. However, the current
interpretation of NA (documentation) being treated as FALSE is not needed /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of
subsetting in data.table mimicking "subset" function (in order to be in line
with SQL) by dropping NA evaluated logicals. From a couple of posts before,
where I pasted the code where NAs are replaced to FALSE were not necessary
as `irows <- which(i)` makes clear that `which` is being used to get indices
and then subset, this fits perfectly well with the interpretation of NA in
data.table.

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :


http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's
another nice layer towards simplicity in data.table. But I still think there
should not be an inconsistency in equivalent logical operations to provide
different results. If !(x== .) and x != . are indeed different, then I'd
suppose replacing `!` with a more appropriate name as it's much easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the underlying
meaning of these are the same, or the `!` in `!(x==.)` must be replaced to
something that's more appropriate for what it's supposed to be. Personally,
I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before
in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions
with ease in `i`, by providing both "subset" (default) and the typical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list
[hidden email]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan
In reply to this post by Matthew Dowle
Matthew, 

It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in:

dt <- data.table(x=c(0,0,1,1,3), y=1:5)
setkey(dt, "x")
dt[J(c(1,3))] # join
   x y
1: 1 3
2: 1 4
3: 3 5

dt[!J(c(1,3))]
   x y
1: 0 1
2: 0 2

Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? 

Just a thought. I hope it makes at least a little sense.

Best,
Arun

On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote:


Hm, another good point. We need ~ for formulae, although I can't
imagine a formula in i (only in j). But in both i and j we might want
to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression
syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a
prefix.

- maybe then? Consistent with - meaning in R. I don't think I
actually had a specific use in mind for - and +, to reserve them for,
but at the time it just seemed a shame to use up one of -/+ without
defining the other. If - does a not join, then, might + be more like
merge() (i.e. returning the union of the rows in x and i by join). I
think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE, but my
thinking there is readability, since we often have many lines in i, j
and by in that order, and if the "notjoin=TRUE" followed afterwards it
would be far away from the i argument to which it applies. If we
incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet
more parameters, too.


On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for. For
example, perhaps ~ x could mean get(x). One aspect of data.table
that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
Matthew,

How about ~ instead of ! ? I ruled out - previously to leave +
and -
available for future use. NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me. I don't think of !(x==.)
being the
same as x!=. ! is simply a prefix. It's all the rows that
aren't
returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to
do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour
has to be
changed. Let's forget this point for a moment.

That needs to be fixed. But we're getting quite theoretical here
and far
away from common use cases. Why would we ever have row numbers of
the
table, as a column of the table itself and want to select the rows
by number
not mentioned in that column?

Probably I did not choose a good example. Suppose that I've a
data.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But
it makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same
result as
`DT[x == 0]` (even though I fully understand it's not the intended
behaviour
of data.table), as it's more intuitive and less confusing.

So, changing `!` to `~` or `NJ` is one half of the issue for me. The
other
is to replace the actual function of `!` in all contexts. I hope I
came
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,

How about ~ instead of ! ? I ruled out - previously to leave +
and -
available for future use. NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments
here:

Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in
logical i
arguments. -- the only issue is the ! prefix signifying a not-join,
not the
way one might expect. Perhaps the not join prefix could have been NJ
not !
to avoid this confusion -- this might be another discussion to have
on the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point. Is data.table consistent with SQL already, for both
== and
!=, and so no change needed?

Yes, I believe it's already consistent with SQL. However, the
current
interpretation of NA (documentation) being treated as FALSE is not
needed /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role
of
subsetting in data.table mimicking "subset" function (in order to be
in line
with SQL) by dropping NA evaluated logicals. From a couple of posts
before,
where I pasted the code where NAs are replaced to FALSE were not
necessary
as `irows <- which(i)` makes clear that `which` is being used to get
indices
and then subset, this fits perfectly well with the interpretation of
NA in
data.table.

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA
inconsistently? :



Ha, I like the idea behind the use of () in evaluating expressions.
It's
another nice layer towards simplicity in data.table. But I still
think there
should not be an inconsistency in equivalent logical operations to
provide
different results. If !(x== .) and x != . are indeed different, then
I'd
suppose replacing `!` with a more appropriate name as it's much
easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the
underlying
meaning of these are the same, or the `!` in `!(x==.)` must be
replaced to
something that's more appropriate for what it's supposed to be.
Personally,
I prefer the former. It would greatly tighten the structure and
consistency.

"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
before
in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating
expressions
with ease in `i`, by providing both "subset" (default) and the
typical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Frank Erickson
In reply to this post by Matthew Dowle
I prefer ~ and/or NJ() over -. The not-join operation is different from the subsetting operation usually associated with -.

I don't know what characters are available for this sort of thing, but @x, @(x,y) seems natural enough as syntax for a getter.


On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle <[hidden email]> wrote:

Hm, another good point.   We need ~ for formulae,  although I can't imagine a formula in i (only in j).  But in both i and j we might want to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax,  but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix.

- maybe then?  Consistent with - meaning in R.  I don't think I actually had a specific use in mind for - and +, to reserve them for,  but at the time it just seemed a shame to use up one of -/+ without defining the other.  If - does a not join, then, might + be more like merge() (i.e. returning the union of the rows in x and i by join).  I think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE,  but my thinking there is readability,  since we often have many lines in i, j and by in that order, and if the "notjoin=TRUE" followed afterwards it would be far away from the i argument to which it applies.  If we incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet more parameters, too.



On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x).  One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[hidden email]> wrote:
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the
same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far
away from common use cases.  Why would we ever have row numbers of the
table, as a column of the table itself and want to select the rows by number
not mentioned in that column?

Probably I did not choose a good example. Suppose that I've a data.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as
`DT[x == 0]` (even though I fully understand it's not the intended behaviour
of data.table), as it's more intuitive and less confusing.

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other
is to replace the actual function of `!` in all contexts. I hope I came
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here:

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i
arguments. -- the only issue is the ! prefix signifying a not-join, not the
way one might expect. Perhaps the not join prefix could have been NJ not !
to avoid this confusion -- this might be another discussion to have on the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and
!=, and so no change needed?

Yes, I believe it's already consistent with SQL. However, the current
interpretation of NA (documentation) being treated as FALSE is not needed /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of
subsetting in data.table mimicking "subset" function (in order to be in line
with SQL) by dropping NA evaluated logicals. From a couple of posts before,
where I pasted the code where NAs are replaced to FALSE were not necessary
as `irows <- which(i)` makes clear that `which` is being used to get indices
and then subset, this fits perfectly well with the interpretation of NA in
data.table.

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :


http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's
another nice layer towards simplicity in data.table. But I still think there
should not be an inconsistency in equivalent logical operations to provide
different results. If !(x== .) and x != . are indeed different, then I'd
suppose replacing `!` with a more appropriate name as it's much easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the underlying
meaning of these are the same, or the `!` in `!(x==.)` must be replaced to
something that's more appropriate for what it's supposed to be. Personally,
I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before
in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions
with ease in `i`, by providing both "subset" (default) and the typical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list
[hidden email]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

eddi
Btw, since we're on the topic of join/not-join syntax does this break others' expectations or is it just me?

> dt = data.table(x = c(1,2,3))
> setkey(dt,x)
> dt[J(1)]
   x
1: 1
> dt[!J(1)]
   x
1: 2
2: 3
> dt[(!J(1))]
Error in eval(expr, envir, enclos) : could not find function "J"
> dt[(J(1))]
Error in eval(expr, envir, enclos) : could not find function "J"

I understand why this happens internally, because the function "()" is read as the head of the expression tree, but it's still pretty weird.


On Mon, Jun 10, 2013 at 9:55 AM, Frank Erickson <[hidden email]> wrote:
I prefer ~ and/or NJ() over -. The not-join operation is different from the subsetting operation usually associated with -.

I don't know what characters are available for this sort of thing, but @x, @(x,y) seems natural enough as syntax for a getter.


On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle <[hidden email]> wrote:

Hm, another good point.   We need ~ for formulae,  although I can't imagine a formula in i (only in j).  But in both i and j we might want to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax,  but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix.

- maybe then?  Consistent with - meaning in R.  I don't think I actually had a specific use in mind for - and +, to reserve them for,  but at the time it just seemed a shame to use up one of -/+ without defining the other.  If - does a not join, then, might + be more like merge() (i.e. returning the union of the rows in x and i by join).  I think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE,  but my thinking there is readability,  since we often have many lines in i, j and by in that order, and if the "notjoin=TRUE" followed afterwards it would be far away from the i argument to which it applies.  If we incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet more parameters, too.



On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x).  One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[hidden email]> wrote:
Matthew,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me.   I don't think of !(x==.) being the
same as  x!=.    ! is simply a prefix.    It's all the rows that aren't
returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
changed. Let's forget this point for a moment.

That needs to be fixed.  But we're getting quite theoretical here and far
away from common use cases.  Why would we ever have row numbers of the
table, as a column of the table itself and want to select the rows by number
not mentioned in that column?

Probably I did not choose a good example. Suppose that I've a data.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as
`DT[x == 0]` (even though I fully understand it's not the intended behaviour
of data.table), as it's more intuitive and less confusing.

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other
is to replace the actual function of `!` in all contexts. I hope I came
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,

How about ~ instead of ! ?      I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here:

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i
arguments. -- the only issue is the ! prefix signifying a not-join, not the
way one might expect. Perhaps the not join prefix could have been NJ not !
to avoid this confusion -- this might be another discussion to have on the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point.  Is data.table consistent with SQL already, for both == and
!=, and so no change needed?

Yes, I believe it's already consistent with SQL. However, the current
interpretation of NA (documentation) being treated as FALSE is not needed /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of
subsetting in data.table mimicking "subset" function (in order to be in line
with SQL) by dropping NA evaluated logicals. From a couple of posts before,
where I pasted the code where NAs are replaced to FALSE were not necessary
as `irows <- which(i)` makes clear that `which` is being used to get indices
and then subset, this fits perfectly well with the interpretation of NA in
data.table.

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :


http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

 Ha, I like the idea behind the use of () in evaluating expressions. It's
another nice layer towards simplicity in data.table. But I still think there
should not be an inconsistency in equivalent logical operations to provide
different results. If !(x== .) and x != . are indeed different, then I'd
suppose replacing `!` with a more appropriate name as it's much easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the underlying
meaning of these are the same, or the `!` in `!(x==.)` must be replaced to
something that's more appropriate for what it's supposed to be. Personally,
I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me.  I'd only considered nomatch before
in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions
with ease in `i`, by providing both "subset" (default) and the typical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list
[hidden email]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Follow-up on subsetting data.table with NAs

Matthew Dowle
In reply to this post by Arunkumar Srinivasan

 

Hi Arun,

Indeed.  ! was introduced for not-join i.e. X[!Y] where i is type data.table.  Extending it to vectors seemed to make sense at the time; e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where X[-(3:6)] was intended) were in my mind.   I think of everything as a join really; e.g., "where rownumber = i".

But I think I'm fine with ! being not-join for data.table/list i only.  Or is it just logical vector i to be turned off only, and could leave ! as-is for character and integer vector i?

Matthew

 

On 10.06.2013 15:52, Arunkumar Srinivasan wrote:

Matthew, 
It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in:
dt <- data.table(x=c(0,0,1,1,3), y=1:5)
setkey(dt, "x")
dt[J(c(1,3))] # join
   x y
1: 1 3
2: 1 4
3: 3 5
dt[!J(c(1,3))]
   x y
1: 0 1
2: 0 2
Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? 
Just a thought. I hope it makes at least a little sense.
Best,
Arun

On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote:

Hm, another good point. We need ~ for formulae, although I can't
imagine a formula in i (only in j). But in both i and j we might want
to get(x).
I thought about ^ i.e. X[^Y] in the spirit of regular expression
syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a
prefix.
- maybe then? Consistent with - meaning in R. I don't think I
actually had a specific use in mind for - and +, to reserve them for,
but at the time it just seemed a shame to use up one of -/+ without
defining the other. If - does a not join, then, might + be more like
merge() (i.e. returning the union of the rows in x and i by join). I
think I had something like that in mind, but hadn't thought it through.
Some might say it should be a new argument e.g. notjoin=TRUE, but my
thinking there is readability, since we often have many lines in i, j
and by in that order, and if the "notjoin=TRUE" followed afterwards it
would be far away from the i argument to which it applies. If we
incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet
more parameters, too.
On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.
I can think of other things that ~ might be better used for. For
example, perhaps ~ x could mean get(x). One aspect of data.table
that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.
On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
Matthew,
How about ~ instead of ! ? I ruled out - previously to leave +
and -
available for future use. NJ() may be possible too.
Both "NJ()" and "~" are okay for me.
That result makes perfect sense to me. I don't think of !(x==.)
being the
same as x!=. ! is simply a prefix. It's all the rows that
aren't
returned if the ! prefix wasn't there.
I understand that `DT[!(x)]` does what `data.table` is designed to
do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour
has to be
changed. Let's forget this point for a moment.
That needs to be fixed. But we're getting quite theoretical here
and far
away from common use cases. Why would we ever have row numbers of
the
table, as a column of the table itself and want to select the rows
by number
not mentioned in that column?
Probably I did not choose a good example. Suppose that I've a
data.table and
I want to get all rows where "x == 0". Let's say:
set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))
DF <- as.data.frame(DT)
To get all rows where x == 0, it could be done with DT[x == 0]. But
it makes
sense, at least in the context of data.frames, to do equivalently,
DF[!(DF$x), ] (or) DF[DF$x == 0, ]
All I want to say is, I expect `DT[!(x)]` should give the same
result as
`DT[x == 0]` (even though I fully understand it's not the intended
behaviour
of data.table), as it's more intuitive and less confusing.
So, changing `!` to `~` or `NJ` is one half of the issue for me. The
other
is to replace the actual function of `!` in all contexts. I hope I
came
across with what I wanted to say, better this time.
Best,
Arun
On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
Hi,
How about ~ instead of ! ? I ruled out - previously to leave +
and -
available for future use. NJ() may be possible too.
Matthew
On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments
here:
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in
logical i
arguments. -- the only issue is the ! prefix signifying a not-join,
not the
way one might expect. Perhaps the not join prefix could have been NJ
not !
to avoid this confusion -- this might be another discussion to have
on the
mailing list -- (I think it is a discussion worth having)
Arun
On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
Hm, good point. Is data.table consistent with SQL already, for both
== and
!=, and so no change needed?
Yes, I believe it's already consistent with SQL. However, the
current
interpretation of NA (documentation) being treated as FALSE is not
needed /
untrue, imho (Please see below).
And it was correct for Frank to be mistaken.
Yes, it seems like he was mistaken.
Maybe just some more documentation and examples needed then.
It'd be much more appropriate if the documentation reflects the role
of
subsetting in data.table mimicking "subset" function (in order to be
in line
with SQL) by dropping NA evaluated logicals. From a couple of posts
before,
where I pasted the code where NAs are replaced to FALSE were not
necessary
as `irows <- which(i)` makes clear that `which` is being used to get
indices
and then subset, this fits perfectly well with the interpretation of
NA in
data.table.
Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA
inconsistently? :
Ha, I like the idea behind the use of () in evaluating expressions.
It's
another nice layer towards simplicity in data.table. But I still
think there
should not be an inconsistency in equivalent logical operations to
provide
different results. If !(x== .) and x != . are indeed different, then
I'd
suppose replacing `!` with a more appropriate name as it's much
easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the
underlying
meaning of these are the same, or the `!` in `!(x==.)` must be
replaced to
something that's more appropriate for what it's supposed to be.
Personally,
I prefer the former. It would greatly tighten the structure and
consistency.
"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
before
in the context of joins, not logical subsets.
Yes, I find this option would give more control in evaluating
expressions
with ease in `i`, by providing both "subset" (default) and the
typical
data.frame subsetting (na.rm = FALSE).
Best regards,
Arun
_______________________________________________
datatable-help mailing list

 

 

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
12
Loading...