the difference between "-" and "!" between base and data.table package

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

the difference between "-" and "!" between base and data.table package

R help mailing list-2
Hi


I normally use package data.table but today was doing some base R coding.  Had a problem for a bit which I finally resolved.  I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame.  All I was getting was zero observations.  Changed to using "-" and it worked.  I recalled that in data.table the "!" function worked, so created this little bit of code.

#  Base R Functions
str(mtcars)
train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars)))
train <- mtcars[train_indices,]
mode(train_indices); class(train_indices)
test <- mtcars[!train_indices,]  #  the "!" function returning 0 observations
test_1 <- mtcars[-train_indices,]
identical(test, test_1)

#  Using data.table package
library(data.table)
dt1 <- data.table(mtcars)
train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1)))
train <- dt1[train_indices,]
mode(train_indices); class(train_indices)
test <- dt1[!train_indices,]  #  the "!" function
test_1 <- dt1[-train_indices,]
identical(test, test_1)
The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??
Carl Sutton

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: the difference between "-" and "!" between base and data.table package

Jeff Newmiller
! is a logical operator... it means "not". When you write

lidx <- seq_along( mtcars[[ 1 ]] ) %in% train_indices

you end up with a vector of logical values for which ! makes sense. Since R supports logical indexing this can be a very convenient way to select one group or the other.

If you give an integer to the ! operator, any non-zero value is treated as TRUE, which can be useful sometimes but not in this case, since all of the train_indices are greater than zero. Look at what !train_indices actually is.

As the Introduction to R document says, integer indexing always starts at 1 instead of zero as in many other languages. This makes it feasible to let negative integers as indexes represent the idea of excluding those positions. Thus

identical( mtcars[ !lidx, ], mtcars[ -train_indices, ] )

The ItoR document is really quite informative to re-read occasionally. For example, look up indexing with a matrix as the index.
--
Sent from my phone. Please excuse my brevity.

On April 15, 2017 5:18:43 PM PDT, Carl Sutton via R-help <[hidden email]> wrote:

>Hi
>
>
>I normally use package data.table but today was doing some base R
>coding.  Had a problem for a bit which I finally resolved.  I was
>attempting to separate a data frame between train and test sets, and in
>base R was using the "!" to exclude training set indices from the data
>frame.  All I was getting was zero observations.  Changed to using "-"
>and it worked.  I recalled that in data.table the "!" function worked,
>so created this little bit of code.
>
>#  Base R Functions
>str(mtcars)
>train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars)))
>train <- mtcars[train_indices,]
>mode(train_indices); class(train_indices)
>test <- mtcars[!train_indices,]  #  the "!" function returning 0
>observations
>test_1 <- mtcars[-train_indices,]
>identical(test, test_1)
>
>#  Using data.table package
>library(data.table)
>dt1 <- data.table(mtcars)
>train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1)))
>train <- dt1[train_indices,]
>mode(train_indices); class(train_indices)
>test <- dt1[!train_indices,]  #  the "!" function
>test_1 <- dt1[-train_indices,]
>identical(test, test_1)
>The documentation appears to me to accept "!" in base, so do I have
>some kind of ridiculous error or ..??
>Carl Sutton
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: the difference between "-" and "!" between base and data.table package

David Winsemius
In reply to this post by R help mailing list-2

> On Apr 15, 2017, at 5:18 PM, Carl Sutton via R-help <[hidden email]> wrote:
>
> Hi
>
>
> I normally use package data.table but today was doing some base R coding.  Had a problem for a bit which I finally resolved.  I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame.  All I was getting was zero observations.  Changed to using "-" and it worked.  I recalled that in data.table the "!" function worked, so created this little bit of code.
>
> #  Base R Functions
> str(mtcars)
> train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars)))
> train <- mtcars[train_indices,]
> mode(train_indices); class(train_indices)
> test <- mtcars[!train_indices,]  #  the "!" function returning 0 observations

The arguments you are supplying:

> table( !train_indices )

FALSE
   24


> test_1 <- mtcars[-train_indices,]
> identical(test, test_1)
>
> #  Using data.table package
> library(data.table)
> dt1 <- data.table(mtcars)
> train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1)))
> train <- dt1[train_indices,]

The data.table "[" function has very different syntax and evaluation rules than does the data.frame "[" function, but I guess you know that.


> mode(train_indices); class(train_indices)
> test <- dt1[!train_indices,]  #  the "!" function
> test_1 <- dt1[-train_indices,]
> identical(test, test_1)
> The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??

Not sure about "ridiculous" and you have not actually said what it was that _you_ were questioning.
If it is the lack of any return from `test <- mtcars[!train_indices,]` than it could be argued that was a ridiculous expectation at least according to the rules of vector evaluation in row selection that I thought I understood. Giving a vector of FALSE values to `[.data.frame` would not reasonably be expected to return anything. Whether giving a vector of only FALSE's to `[.data.table` and actually getting something back does seem kind of unexpected to me, but clearly it didn't seem ridiculous to Matt Dowle. Clearly the recycling rules for `[.data.table are different than those of `[.data.frame`. Data.tables don't use rownames.


The results from:

> dt1[rep(FALSE,24), ]
Error in `[.data.table`(dt1, rep(FALSE, 24), ) :
  i evaluates to a logical vector length 24 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

... is different than from

dt1[!train_indices, ] # get 8 rows.

To me that doesn't make sense.


I generally use %in% for row selection. But many people would also find this pair of results "ridiculous":

> mtcars[ which( train_indices %in% 50:100), ]
 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)
> mtcars[ -which( train_indices %in% 50:100), ]   # bad idea to use minus before which()
 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

Yes, I know that some people think the `which` is not needed. I'm not one of them.

--
David.


> Carl Sutton
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: the difference between "-" and "!" between base and data.table package

R help mailing list-2
Hi


Thank you all for your input.  But I must apologize.  When I was searching the help page I went this far and stopped
Logic {base} R Documentation
Logical Operators

Description

These operators act on raw, logical and number-like vectors.

Usage

! x
x & y
x && y
x | y
x || y
xor(x, y)

isTRUE(x)
Arguments

x, y
raw or logical or ‘number-like’ vectors (i.e., of types double (class numeric), integer and complex)), or objects for which methods have been written.

At that point I took a look at train_indices and it was indeed a vector of integers, so directed my inquiry to the list.

After reading the answers it was fairly obvious I had missed something, and in the details is
   
    Numeric and complex vectors will be coerced to logical values, with zero being false and all non-zero     values being true. Raw vectors are handled without any coercion for !, &, | and xor, with these     operators being applied bitwise (so ! is the 1s-complement).
I was truly lax in my search of the documentation
Again, thank for you for your time and expertise, I will try to be more complete in my research in the future
 
Carl Sutton


On Sunday, April 16, 2017 1:00 AM, David Winsemius <[hidden email]> wrote:




> On Apr 15, 2017, at 5:18 PM, Carl Sutton via R-help <[hidden email]> wrote:
>
> Hi
>
>
> I normally use package data.table but today was doing some base R coding.  Had a problem for a bit which I finally resolved.  I was attempting to separate a data frame between train and test sets, and in base R was using the "!" to exclude training set indices from the data frame.  All I was getting was zero observations.  Changed to using "-" and it worked.  I recalled that in data.table the "!" function worked, so created this little bit of code.
>
> #  Base R Functions
> str(mtcars)
> train_indices <- sample(nrow(mtcars), round(0.75*nrow(mtcars)))
> train <- mtcars[train_indices,]
> mode(train_indices); class(train_indices)
> test <- mtcars[!train_indices,]  #  the "!" function returning 0 observations

The arguments you are supplying:

> table( !train_indices )

FALSE
   24


> test_1 <- mtcars[-train_indices,]
> identical(test, test_1)
>
> #  Using data.table package
> library(data.table)
> dt1 <- data.table(mtcars)
> train_indices <- sample(nrow(dt1), round(0.75*nrow(dt1)))
> train <- dt1[train_indices,]

The data.table "[" function has very different syntax and evaluation rules than does the data.frame "[" function, but I guess you know that.


> mode(train_indices); class(train_indices)
> test <- dt1[!train_indices,]  #  the "!" function
> test_1 <- dt1[-train_indices,]
> identical(test, test_1)
> The documentation appears to me to accept "!" in base, so do I have some kind of ridiculous error or ..??

Not sure about "ridiculous" and you have not actually said what it was that _you_ were questioning.
If it is the lack of any return from `test <- mtcars[!train_indices,]` than it could be argued that was a ridiculous expectation at least according to the rules of vector evaluation in row selection that I thought I understood. Giving a vector of FALSE values to `[.data.frame` would not reasonably be expected to return anything. Whether giving a vector of only FALSE's to `[.data.table` and actually getting something back does seem kind of unexpected to me, but clearly it didn't seem ridiculous to Matt Dowle. Clearly the recycling rules for `[.data.table are different than those of `[.data.frame`. Data.tables don't use rownames.


The results from:

> dt1[rep(FALSE,24), ]
Error in `[.data.table`(dt1, rep(FALSE, 24), ) :
  i evaluates to a logical vector length 24 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

... is different than from

dt1[!train_indices, ] # get 8 rows.

To me that doesn't make sense.


I generally use %in% for row selection. But many people would also find this pair of results "ridiculous":

> mtcars[ which( train_indices %in% 50:100), ]
[1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)
> mtcars[ -which( train_indices %in% 50:100), ]   # bad idea to use minus before which()
[1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

Yes, I know that some people think the `which` is not needed. I'm not one of them.

--
David.



> Carl Sutton
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...