Odd behaviour of mean() with a numeric column in a tibble

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd behaviour of mean() with a numeric column in a tibble

Chris Evans
I hope I am obeying the list rules here. I am using a raw R IDE for this and running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)

Here is a reproducible example.  Code only first

require(tibble)
tmpTibble <- tibble(ID=letters,num=1:26)
min(tmpTibble[,2]) # fine
max(tmpTibble[,2]) # fine
median(tmpTibble[,2])  # not fine
mean(tmpTibble[,2])    # not fine
newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
newMedianFun(tmpTibble[,2]) # ditto
str(tmpTibble[,2])

### then I tried this to make sure it wasn't about having fed in integers

tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
tmpTibble2
mean(tmpTibble2[,3]) # not fine, not about integers!


### before I just created tmpTibble2 I found myself trying to add a column to tmpTibble
tmpTibble$newNum <- tmpTibble[,2]/10  # NO!
tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
### and oddly enough ...
add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!

Now here it is with the output:

> require(tibble)
Loading required package: tibble
> tmpTibble <- tibble(ID=letters,num=1:26)
> min(tmpTibble[,2]) # fine
[1] 1
> max(tmpTibble[,2]) # fine
[1] 26
> median(tmpTibble[,2])  # not fine
Error in median.default(tmpTibble[, 2]) : need numeric data
> mean(tmpTibble[,2])    # not fine
[1] NA
Warning message:
In mean.default(tmpTibble[, 2]) :
  argument is not numeric or logical: returning NA
> newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
> newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
[1] 13.5
> newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
> newMedianFun(tmpTibble[,2]) # ditto
[1] 13.5
> str(tmpTibble[,2])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       26 obs. of  1 variable:
 $ num: int  1 2 3 4 5 6 7 8 9 10 ...
>
> ### then I tried this to make sure it wasn't about having fed in integers
>
> tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
> tmpTibble2
# A tibble: 26 × 3
      ID   num  num2
   <chr> <int> <dbl>
1      a     1   0.1
2      b     2   0.2
3      c     3   0.3
4      d     4   0.4
5      e     5   0.5
6      f     6   0.6
7      g     7   0.7
8      h     8   0.8
9      i     9   0.9
10     j    10   1.0
# ... with 16 more rows
> mean(tmpTibble2[,3]) # not fine, not about integers!
[1] NA
Warning message:
In mean.default(tmpTibble2[, 3]) :
  argument is not numeric or logical: returning NA
>
>
> ### before I just created tmpTibble2 I found myself trying to add a column to tmpTibble
> tmpTibble$newNum <- tmpTibble[,2]/10  # NO!
> tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
> ### and oddly enough ...
> add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
Error: Each variable must be a 1d atomic vector or list.
Problem variables: 'newNum'
>
>

I discovered this when I hit odd behaviour after using read_spss() from the haven package for the first time as it seemed to be offering a step forward over good old read.spss() from the excellent foreign package.  I am reporting it here not directly to Prof. Wickham as the issues seem rather general though I'm guessing that it needs to be fixed with a fix to tibble.   Or perhaps I've completely missed something.

TIA,

Chris

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Ista Zahn
Not at a computer to check right now, but I believe single bracket indexing
a tibble always returns a tibble. To extract a vector use [[

On Dec 6, 2016 4:28 PM, "Chris Evans" <[hidden email]> wrote:
>
> I hope I am obeying the list rules here. I am using a raw R IDE for this
and running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>
> Here is a reproducible example.  Code only first
>
> require(tibble)
> tmpTibble <- tibble(ID=letters,num=1:26)
> min(tmpTibble[,2]) # fine
> max(tmpTibble[,2]) # fine
> median(tmpTibble[,2])  # not fine
> mean(tmpTibble[,2])    # not fine

I think you want

mean(tmpTibble[[2]]

> newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
> newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
necessary?!

> newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
> newMedianFun(tmpTibble[,2]) # ditto
> str(tmpTibble[,2])
>
> ### then I tried this to make sure it wasn't about having fed in integers
>
> tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
> tmpTibble2
> mean(tmpTibble2[,3]) # not fine, not about integers!
>
>
> ### before I just created tmpTibble2 I found myself trying to add a
column to tmpTibble

> tmpTibble$newNum <- tmpTibble[,2]/10  # NO!
> tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
> ### and oddly enough ...
> add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>
> Now here it is with the output:
>
> > require(tibble)
> Loading required package: tibble
> > tmpTibble <- tibble(ID=letters,num=1:26)
> > min(tmpTibble[,2]) # fine
> [1] 1
> > max(tmpTibble[,2]) # fine
> [1] 26
> > median(tmpTibble[,2])  # not fine
> Error in median.default(tmpTibble[, 2]) : need numeric data
> > mean(tmpTibble[,2])    # not fine
> [1] NA
> Warning message:
> In mean.default(tmpTibble[, 2]) :
>   argument is not numeric or logical: returning NA
> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
necessary?!
> [1] 13.5
> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
> > newMedianFun(tmpTibble[,2]) # ditto
> [1] 13.5
> > str(tmpTibble[,2])
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       26 obs. of  1 variable:
>  $ num: int  1 2 3 4 5 6 7 8 9 10 ...
> >
> > ### then I tried this to make sure it wasn't about having fed in
integers

> >
> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
> > tmpTibble2
> # A tibble: 26 × 3
>       ID   num  num2
>    <chr> <int> <dbl>
> 1      a     1   0.1
> 2      b     2   0.2
> 3      c     3   0.3
> 4      d     4   0.4
> 5      e     5   0.5
> 6      f     6   0.6
> 7      g     7   0.7
> 8      h     8   0.8
> 9      i     9   0.9
> 10     j    10   1.0
> # ... with 16 more rows
> > mean(tmpTibble2[,3]) # not fine, not about integers!
> [1] NA
> Warning message:
> In mean.default(tmpTibble2[, 3]) :
>   argument is not numeric or logical: returning NA
> >
> >
> > ### before I just created tmpTibble2 I found myself trying to add a
column to tmpTibble

> > tmpTibble$newNum <- tmpTibble[,2]/10  # NO!
> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
> > ### and oddly enough ...
> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
> Error: Each variable must be a 1d atomic vector or list.
> Problem variables: 'newNum'
> >
> >
>
> I discovered this when I hit odd behaviour after using read_spss() from
the haven package for the first time as it seemed to be offering a step
forward over good old read.spss() from the excellent foreign package.  I am
reporting it here not directly to Prof. Wickham as the issues seem rather
general though I'm guessing that it needs to be fixed with a fix to
tibble.   Or perhaps I've completely missed something.
>
> TIA,
>
> Chris
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Chris Evans
{{SIGH}}

You are absolutely right.

I wonder if I am losing some cognitive capacities that are needed to be part of the evolving R community. It seems to me that if a tibble is designed to be an enhanced replacement for a dataframe then it shouldn't quite so radically change things.

I notice that the documentation on tibble says "[ Never simplifies (drops), so always returns data.frame"
That is much less explicit than I would have liked and actually doesn't seem to be true. In fact, as you rightly say, it generally, but not quite always, returns a tibble. In fact it can be fooled into a vector of length 1.

> tmpTibble[[1,]]
Error in `[[.data.frame`(tmpTibble, 1, ) :
argument "..2" is missing, with no default

> tmpTibble[1]
# A tibble: 26 × 1
ID
<chr>
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
# ... with 16 more rows
> tmpTibble[,1]
# A tibble: 26 × 1
ID
<chr>
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
# ... with 16 more rows
> tmpTibble[1,]
Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
replacement element 3 is a matrix/data frame of 26 rows, need 1
In addition: Warning messages:
1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
replacement element 1 has 26 rows to replace 1 rows
2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
replacement element 2 has 26 rows to replace 1 rows
> tmpTibble[1,1:26]
Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
> tmpTibble[[1,2]]
[1] 1
> str(tmpTibble[[1,2]])
int 1
> str(tmpTibble[[1:2,2]])
Error in col[[i, exact = exact]] :
attempt to select more than one element in vectorIndex
>
> tmpTibble[[1,1:2]]
[1] "b"
>

So [[a,b]] works if a and b are legal with the dimensions of the tibble and if a is of length 1 but returns NOT a tibble but a vector of length 1 (I think), I can see that's logical but not what it says in the documentation.

[[a]] and [[,a]] return the same result, that seems excessively tolerant to me.

[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a tibble.

And row subsetting/indexing has gone.

Why create replacement for a dataframe that has no row indexing and so radically redefines column indexing, in fact redefines the whole of indexing and subsetting?

OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. Perhaps Prof. Wickham or someone can spell out a bit less tersely, and I think incompletely, than the tibble documentation does, why all this is good.

Thanks anyway Ista, you certainly hit the issue!

Very best all,

Chris

> From: "Ista Zahn" <[hidden email]>
> To: "Chris Evans" <[hidden email]>
> Cc: "r-helpr-project.org" <[hidden email]>
> Sent: Tuesday, 6 December, 2016 21:40:41
> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble

> Not at a computer to check right now, but I believe single bracket indexing a
> tibble always returns a tibble. To extract a vector use [[

> On Dec 6, 2016 4:28 PM, "Chris Evans" < [hidden email] > wrote:

>> I hope I am obeying the list rules here. I am using a raw R IDE for this and
> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)

> > Here is a reproducible example. Code only first

> > require(tibble)
> > tmpTibble <- tibble(ID=letters,num=1:26)
> > min(tmpTibble[,2]) # fine
> > max(tmpTibble[,2]) # fine
> > median(tmpTibble[,2]) # not fine
> > mean(tmpTibble[,2]) # not fine

> I think you want

> mean(tmpTibble[[2]]

> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
> > newMedianFun(tmpTibble[,2]) # ditto
> > str(tmpTibble[,2])

> > ### then I tried this to make sure it wasn't about having fed in integers

> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
> > tmpTibble2
> > mean(tmpTibble2[,3]) # not fine, not about integers!


>> ### before I just created tmpTibble2 I found myself trying to add a column to
> > tmpTibble
> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
> > ### and oddly enough ...
> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!

> > Now here it is with the output:

> > > require(tibble)
> > Loading required package: tibble
> > > tmpTibble <- tibble(ID=letters,num=1:26)
> > > min(tmpTibble[,2]) # fine
> > [1] 1
> > > max(tmpTibble[,2]) # fine
> > [1] 26
> > > median(tmpTibble[,2]) # not fine
> > Error in median.default(tmpTibble[, 2]) : need numeric data
> > > mean(tmpTibble[,2]) # not fine
> > [1] NA
> > Warning message:
> > In mean.default(tmpTibble[, 2]) :
> > argument is not numeric or logical: returning NA
> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
> > [1] 13.5
> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
> > > newMedianFun(tmpTibble[,2]) # ditto
> > [1] 13.5
> > > str(tmpTibble[,2])
> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...

> > > ### then I tried this to make sure it wasn't about having fed in integers

> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
> > > tmpTibble2
> > # A tibble: 26 × 3
> > ID num num2
> > <chr> <int> <dbl>
> > 1 a 1 0.1
> > 2 b 2 0.2
> > 3 c 3 0.3
> > 4 d 4 0.4
> > 5 e 5 0.5
> > 6 f 6 0.6
> > 7 g 7 0.7
> > 8 h 8 0.8
> > 9 i 9 0.9
> > 10 j 10 1.0
> > # ... with 16 more rows
> > > mean(tmpTibble2[,3]) # not fine, not about integers!
> > [1] NA
> > Warning message:
> > In mean.default(tmpTibble2[, 3]) :
> > argument is not numeric or logical: returning NA


>> > ### before I just created tmpTibble2 I found myself trying to add a column to
> > > tmpTibble
> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
> > > ### and oddly enough ...
> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
> > Error: Each variable must be a 1d atomic vector or list.
> > Problem variables: 'newNum'



>> I discovered this when I hit odd behaviour after using read_spss() from the
>> haven package for the first time as it seemed to be offering a step forward
>> over good old read.spss() from the excellent foreign package. I am reporting it
>> here not directly to Prof. Wickham as the issues seem rather general though I'm
>> guessing that it needs to be fixed with a fix to tibble. Or perhaps I've
> > completely missed something.

> > TIA,

> > Chris

> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Jeff Newmiller
You really need sleep. Then you need to read

?`[[`

and in particular read about the second argument to the `[[` function, since you don't seem to understand what it is for. Maybe reread the Introduction to R document that comes with R.

The simplest solution is to treat `[[` as supporting one index and `[` as supporting either one or two.

As for expecting any form of row indexing of data frames or tibbles to return a vector, that is hopeless because each column can have a different type.  dta[ 1, ] returns exactly what it has to return to avoid losing fidelity. If you really need row indexing to return a vector you should be using a matrix.
--
Sent from my phone. Please excuse my brevity.

On December 6, 2016 2:10:15 PM PST, Chris Evans <[hidden email]> wrote:

>{{SIGH}}
>
>You are absolutely right.
>
>I wonder if I am losing some cognitive capacities that are needed to be
>part of the evolving R community. It seems to me that if a tibble is
>designed to be an enhanced replacement for a dataframe then it
>shouldn't quite so radically change things.
>
>I notice that the documentation on tibble says "[ Never simplifies
>(drops), so always returns data.frame"
>That is much less explicit than I would have liked and actually doesn't
>seem to be true. In fact, as you rightly say, it generally, but not
>quite always, returns a tibble. In fact it can be fooled into a vector
>of length 1.
>
>> tmpTibble[[1,]]
>Error in `[[.data.frame`(tmpTibble, 1, ) :
>argument "..2" is missing, with no default
>
>> tmpTibble[1]
># A tibble: 26 × 1
>ID
><chr>
>1 a
>2 b
>3 c
>4 d
>5 e
>6 f
>7 g
>8 h
>9 i
>10 j
># ... with 16 more rows
>> tmpTibble[,1]
># A tibble: 26 × 1
>ID
><chr>
>1 a
>2 b
>3 c
>4 d
>5 e
>6 f
>7 g
>8 h
>9 i
>10 j
># ... with 16 more rows
>> tmpTibble[1,]
>Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a",
>:
>replacement element 3 is a matrix/data frame of 26 rows, need 1
>In addition: Warning messages:
>1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>replacement element 1 has 26 rows to replace 1 rows
>2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>replacement element 2 has 26 rows to replace 1 rows
>> tmpTibble[1,1:26]
>Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
>15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
>> tmpTibble[[1,2]]
>[1] 1
>> str(tmpTibble[[1,2]])
>int 1
>> str(tmpTibble[[1:2,2]])
>Error in col[[i, exact = exact]] :
>attempt to select more than one element in vectorIndex
>>
>> tmpTibble[[1,1:2]]
>[1] "b"
>>
>
>So [[a,b]] works if a and b are legal with the dimensions of the tibble
>and if a is of length 1 but returns NOT a tibble but a vector of length
>1 (I think), I can see that's logical but not what it says in the
>documentation.
>
>[[a]] and [[,a]] return the same result, that seems excessively
>tolerant to me.
>
>[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a
>tibble.
>
>And row subsetting/indexing has gone.
>
>Why create replacement for a dataframe that has no row indexing and so
>radically redefines column indexing, in fact redefines the whole of
>indexing and subsetting?
>
>OK. I will go to sleep now and hope to feel less dumb(ed) when I wake.
>Perhaps Prof. Wickham or someone can spell out a bit less tersely, and
>I think incompletely, than the tibble documentation does, why all this
>is good.
>
>Thanks anyway Ista, you certainly hit the issue!
>
>Very best all,
>
>Chris
>
>> From: "Ista Zahn" <[hidden email]>
>> To: "Chris Evans" <[hidden email]>
>> Cc: "r-helpr-project.org" <[hidden email]>
>> Sent: Tuesday, 6 December, 2016 21:40:41
>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a
>tibble
>
>> Not at a computer to check right now, but I believe single bracket
>indexing a
>> tibble always returns a tibble. To extract a vector use [[
>
>> On Dec 6, 2016 4:28 PM, "Chris Evans" < [hidden email] > wrote:
>
>>> I hope I am obeying the list rules here. I am using a raw R IDE for
>this and
>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>
>> > Here is a reproducible example. Code only first
>
>> > require(tibble)
>> > tmpTibble <- tibble(ID=letters,num=1:26)
>> > min(tmpTibble[,2]) # fine
>> > max(tmpTibble[,2]) # fine
>> > median(tmpTibble[,2]) # not fine
>> > mean(tmpTibble[,2]) # not fine
>
>> I think you want
>
>> mean(tmpTibble[[2]]
>
>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
>necessary?!
>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > newMedianFun(tmpTibble[,2]) # ditto
>> > str(tmpTibble[,2])
>
>> > ### then I tried this to make sure it wasn't about having fed in
>integers
>
>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > tmpTibble2
>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>
>
>>> ### before I just created tmpTibble2 I found myself trying to add a
>column to
>> > tmpTibble
>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > ### and oddly enough ...
>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>
>> > Now here it is with the output:
>
>> > > require(tibble)
>> > Loading required package: tibble
>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>> > > min(tmpTibble[,2]) # fine
>> > [1] 1
>> > > max(tmpTibble[,2]) # fine
>> > [1] 26
>> > > median(tmpTibble[,2]) # not fine
>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>> > > mean(tmpTibble[,2]) # not fine
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble[, 2]) :
>> > argument is not numeric or logical: returning NA
>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't
>be necessary?!
>> > [1] 13.5
>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > > newMedianFun(tmpTibble[,2]) # ditto
>> > [1] 13.5
>> > > str(tmpTibble[,2])
>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>
>> > > ### then I tried this to make sure it wasn't about having fed in
>integers
>
>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > > tmpTibble2
>> > # A tibble: 26 × 3
>> > ID num num2
>> > <chr> <int> <dbl>
>> > 1 a 1 0.1
>> > 2 b 2 0.2
>> > 3 c 3 0.3
>> > 4 d 4 0.4
>> > 5 e 5 0.5
>> > 6 f 6 0.6
>> > 7 g 7 0.7
>> > 8 h 8 0.8
>> > 9 i 9 0.9
>> > 10 j 10 1.0
>> > # ... with 16 more rows
>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble2[, 3]) :
>> > argument is not numeric or logical: returning NA
>
>
>>> > ### before I just created tmpTibble2 I found myself trying to add
>a column to
>> > > tmpTibble
>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > > ### and oddly enough ...
>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>> > Error: Each variable must be a 1d atomic vector or list.
>> > Problem variables: 'newNum'
>
>
>
>>> I discovered this when I hit odd behaviour after using read_spss()
>from the
>>> haven package for the first time as it seemed to be offering a step
>forward
>>> over good old read.spss() from the excellent foreign package. I am
>reporting it
>>> here not directly to Prof. Wickham as the issues seem rather general
>though I'm
>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps
>I've
>> > completely missed something.
>
>> > TIA,
>
>> > Chris
>
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Ista Zahn
In reply to this post by Chris Evans
On Tue, Dec 6, 2016 at 5:10 PM, Chris Evans <[hidden email]> wrote:
> {{SIGH}}
>
> You are absolutely right.
>
> I wonder if I am losing some cognitive capacities that are needed to be part of the evolving R community. It seems to me that if a tibble is designed to be an enhanced replacement for a dataframe then it shouldn't quite so radically change things.

Well, there are some things about data frames that are darn annoying,
and tibbles exist partly as an attempt to eliminate some of the
inconsistencies with data.frames. That necessarily means changing
things.

>
> I notice that the documentation on tibble says "[ Never simplifies (drops), so always returns data.frame"
> That is much less explicit than I would have liked and actually doesn't seem to be true. In fact, as you rightly say, it generally, but not quite always, returns a tibble. In fact it can be fooled into a vector of length 1.

Really? How?

>
>> tmpTibble[[1,]]
> Error in `[[.data.frame`(tmpTibble, 1, ) :
> argument "..2" is missing, with no default

That doesn't have anything to do with tibbles:

as.data.frame(tmpTibble)[[1, ]]

gives the same thing.

>
>> tmpTibble[1]
> # A tibble: 26 × 1
> ID
> <chr>
> 1 a
> 2 b
> 3 c
> 4 d
> 5 e
> 6 f
> 7 g
> 8 h
> 9 i
> 10 j
> # ... with 16 more rows

Again, just what you expect from a data.frame (except for the print method).

>> tmpTibble[,1]
> # A tibble: 26 × 1
> ID
> <chr>
> 1 a
> 2 b
> 3 c
> 4 d
> 5 e
> 6 f
> 7 g
> 8 h
> 9 i
> 10 j
> # ... with 16 more rows

That is different, and by design as you noted. It is different from
data.frame indexing, but the data.frame behavior is needlessly
complicated. Sometimes you get a vector, sometimes a data.frame. That
hardly seems worth it given that we already have $ or [[ if you really
wanted a vector.

>> tmpTibble[1,]
> Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 3 is a matrix/data frame of 26 rows, need 1
> In addition: Warning messages:
> 1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 1 has 26 rows to replace 1 rows
> 2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
> replacement element 2 has 26 rows to replace 1 rows

That's not what I get.

> tmpTibble[1,]
# A tibble: 1 × 2
    ID   num
 <chr> <int>
1     a     1

works just as I would expect here.
>> tmpTibble[1,1:26]
> Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26

Other than providing more information about what went wrong this is
the same as data.frame:

> as.data.frame(tmpTibble)[1,1:26]
Error in `[.data.frame`(as.data.frame(tmpTibble), 1, 1:26) :
 undefined columns selected

>> tmpTibble[[1,2]]
> [1] 1

Same as data.frame. (and not at odds with the documentations which
says that [ (not [[ ) always returns a data.frame).

>> str(tmpTibble[[1,2]])
> int 1
>> str(tmpTibble[[1:2,2]])
> Error in col[[i, exact = exact]] :
> attempt to select more than one element in vectorIndex

Same behavior as data.frame.

>>
>> tmpTibble[[1,1:2]]
> [1] "b"
>>

Same behavior as data.frame.
>
> So [[a,b]] works if a and b are legal with the dimensions of the tibble and if a is of length 1 but returns NOT a tibble but a vector of length 1 (I think), I can see that's logical but not what it says in the documentation.

In what documentation? The documentation that says [ always returns a
data.frame? Note that [ and [[ are not the same, and only [ is
documented to always return a data.frame.
>
> [[a]] and [[,a]] return the same result, that seems excessively tolerant to me.

Not for me:

> tmpTibble[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> tmpTibble[[, 1]]
Error in `[[.data.frame`(tmpTibble, , 1) :
 argument "..1" is missing, with no default

(this is the same thing that happens with a data.frame)
>
> [[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a tibble.

That is weird, but not different that data.frame. See above regarding
"NOT a  tibble".

>
> And row subsetting/indexing has gone.

Whatever do you mean?

> tmpTibble[tmpTibble$ID == "d", ]
# A tibble: 1 × 2
    ID   num
 <chr> <int>
1     d     4

>
> Why create replacement for a dataframe that has no row indexing and so radically redefines column indexing, in fact redefines the whole of indexing and subsetting?

It has row indexing, and besides [, x] not dropping dimension it works
pretty much the same.
>
> OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. Perhaps Prof. Wickham or someone can spell out a bit less tersely, and I think incompletely, than the tibble documentation does, why all this is good.

Most of the things you identify here are issues inherited from
data.frame, and and not due differences between tibbles and
data.frames.

Best,
Ista

>
> Thanks anyway Ista, you certainly hit the issue!
>
> Very best all,
>
> Chris
>
>> From: "Ista Zahn" <[hidden email]>
>> To: "Chris Evans" <[hidden email]>
>> Cc: "r-helpr-project.org" <[hidden email]>
>> Sent: Tuesday, 6 December, 2016 21:40:41
>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble
>
>> Not at a computer to check right now, but I believe single bracket indexing a
>> tibble always returns a tibble. To extract a vector use [[
>
>> On Dec 6, 2016 4:28 PM, "Chris Evans" < [hidden email] > wrote:
>
>>> I hope I am obeying the list rules here. I am using a raw R IDE for this and
>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>
>> > Here is a reproducible example. Code only first
>
>> > require(tibble)
>> > tmpTibble <- tibble(ID=letters,num=1:26)
>> > min(tmpTibble[,2]) # fine
>> > max(tmpTibble[,2]) # fine
>> > median(tmpTibble[,2]) # not fine
>> > mean(tmpTibble[,2]) # not fine
>
>> I think you want
>
>> mean(tmpTibble[[2]]
>
>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > newMedianFun(tmpTibble[,2]) # ditto
>> > str(tmpTibble[,2])
>
>> > ### then I tried this to make sure it wasn't about having fed in integers
>
>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > tmpTibble2
>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>
>
>>> ### before I just created tmpTibble2 I found myself trying to add a column to
>> > tmpTibble
>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > ### and oddly enough ...
>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>
>> > Now here it is with the output:
>
>> > > require(tibble)
>> > Loading required package: tibble
>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>> > > min(tmpTibble[,2]) # fine
>> > [1] 1
>> > > max(tmpTibble[,2]) # fine
>> > [1] 26
>> > > median(tmpTibble[,2]) # not fine
>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>> > > mean(tmpTibble[,2]) # not fine
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble[, 2]) :
>> > argument is not numeric or logical: returning NA
>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?!
>> > [1] 13.5
>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>> > > newMedianFun(tmpTibble[,2]) # ditto
>> > [1] 13.5
>> > > str(tmpTibble[,2])
>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>
>> > > ### then I tried this to make sure it wasn't about having fed in integers
>
>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>> > > tmpTibble2
>> > # A tibble: 26 × 3
>> > ID num num2
>> > <chr> <int> <dbl>
>> > 1 a 1 0.1
>> > 2 b 2 0.2
>> > 3 c 3 0.3
>> > 4 d 4 0.4
>> > 5 e 5 0.5
>> > 6 f 6 0.6
>> > 7 g 7 0.7
>> > 8 h 8 0.8
>> > 9 i 9 0.9
>> > 10 j 10 1.0
>> > # ... with 16 more rows
>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>> > [1] NA
>> > Warning message:
>> > In mean.default(tmpTibble2[, 3]) :
>> > argument is not numeric or logical: returning NA
>
>
>>> > ### before I just created tmpTibble2 I found myself trying to add a column to
>> > > tmpTibble
>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>> > > ### and oddly enough ...
>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>> > Error: Each variable must be a 1d atomic vector or list.
>> > Problem variables: 'newNum'
>
>
>
>>> I discovered this when I hit odd behaviour after using read_spss() from the
>>> haven package for the first time as it seemed to be offering a step forward
>>> over good old read.spss() from the excellent foreign package. I am reporting it
>>> here not directly to Prof. Wickham as the issues seem rather general though I'm
>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps I've
>> > completely missed something.
>
>> > TIA,
>
>> > Chris
>
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Chris Evans
In reply to this post by Jeff Newmiller
Thanks to both Jeff and Ista for your inputs some days back.  I confess I was _indeed_ too tired to be thinking well and laterally, and even to be copying things into Emails successfully.

I have since had more sleep (!) and I have read ?`[[`, gone back to the pertinent parts of "Introduction to R" and generally pondered all this.  I confess I had always avoided [[ and only ever used it for lists that were not data frames.  I can now see just how badly I was misguessing its behaviour: apologies, I should have realised that I needed to go right back to basics.

I _can_ see that there are things in the behaviour of data frames that are not that obvious but I had become very used to them.  I can see values in converting to using tibbles instead of data frames and may try to do that.

However, I think the documentation for tibble would be improved for people like myself if it started with something that made it even clearer that tibbles are lists, just as data frames are, but that whereas a data frame has a single class(df) of "data.frame", class(tibble) is:
c("tbl_df","tbl","data.frame").

I can now see that what I get from ?tibble, i.e. "tibble is a trimmed down version of data.frame" is probably technically true though I'd describe it as a rationalised or even a beefed up version of data.frame.  I can also now see that what I find in https://cran.r-project.org/web/packages/tibble/tibble.pdf:

"[ Never simplifies (drops), so always returns data.frame"

is true, but only to the extent that any tibble is still a data.frame but with "data.frame" moved to the third position in the classes of the tibble where it would be the first and only class were it a pure data.frame.  I can also see now that that is not really inconsistent with what I get in https://github.com/tidyverse/tibble:

"Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!"

However, I think it would be better if the tibble.pdf document said:

"[ Never simplifies (drops), so always returns tibble" even though "[ Never simplifies (drops), so always returns data.frame" is technically true, up to and including passing is.data.frame() as

Finally, I think I can see that if want various functions I have written that worked fine on data frames, but which depended on indexing or subsetting those data frames using [,i] or sometimes [,i:j]to select vectors or matrices, then I will have to modify them so they test whether the input is a simple data frame or a data frame that is also a tibble.  I guess that I could have trapped things had my functions (where appropriate) had an is.numeric() input check ... and that I have to use an is.tibble() check, not an is.data.frame() check to distinguish the two!

Ah well, even after years of part-time use of R, I guess it's been good for my soul and my deeper and wider understanding of R to go right back to the basics.

Thanks again to you both.  I am posting here to convey thanks and in case this is useful to anyone like myself who benefits from a bit more narrative than is usually offered by R definitions and help entries.

Chris


----- Original Message -----
> From: "Jeff Newmiller" <[hidden email]>
> To: "Chris Evans" <[hidden email]>, "r-helpr-project.org" <[hidden email]>
> Sent: Tuesday, 6 December, 2016 23:23:28
> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble

> You really need sleep. Then you need to read
>
> ?`[[`
>
> and in particular read about the second argument to the `[[` function, since you
> don't seem to understand what it is for. Maybe reread the Introduction to R
> document that comes with R.
>
> The simplest solution is to treat `[[` as supporting one index and `[` as
> supporting either one or two.
>
> As for expecting any form of row indexing of data frames or tibbles to return a
> vector, that is hopeless because each column can have a different type.  dta[
> 1, ] returns exactly what it has to return to avoid losing fidelity. If you
> really need row indexing to return a vector you should be using a matrix.
> --
> Sent from my phone. Please excuse my brevity.
>
> On December 6, 2016 2:10:15 PM PST, Chris Evans <[hidden email]> wrote:
>>{{SIGH}}
>>
>>You are absolutely right.
>>
>>I wonder if I am losing some cognitive capacities that are needed to be
>>part of the evolving R community. It seems to me that if a tibble is
>>designed to be an enhanced replacement for a dataframe then it
>>shouldn't quite so radically change things.
>>
>>I notice that the documentation on tibble says "[ Never simplifies
>>(drops), so always returns data.frame"
>>That is much less explicit than I would have liked and actually doesn't
>>seem to be true. In fact, as you rightly say, it generally, but not
>>quite always, returns a tibble. In fact it can be fooled into a vector
>>of length 1.
>>
>>> tmpTibble[[1,]]
>>Error in `[[.data.frame`(tmpTibble, 1, ) :
>>argument "..2" is missing, with no default
>>
>>> tmpTibble[1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[,1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[1,]
>>Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a",
>>:
>>replacement element 3 is a matrix/data frame of 26 rows, need 1
>>In addition: Warning messages:
>>1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 1 has 26 rows to replace 1 rows
>>2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 2 has 26 rows to replace 1 rows
>>> tmpTibble[1,1:26]
>>Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
>>15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
>>> tmpTibble[[1,2]]
>>[1] 1
>>> str(tmpTibble[[1,2]])
>>int 1
>>> str(tmpTibble[[1:2,2]])
>>Error in col[[i, exact = exact]] :
>>attempt to select more than one element in vectorIndex
>>>
>>> tmpTibble[[1,1:2]]
>>[1] "b"
>>>
>>
>>So [[a,b]] works if a and b are legal with the dimensions of the tibble
>>and if a is of length 1 but returns NOT a tibble but a vector of length
>>1 (I think), I can see that's logical but not what it says in the
>>documentation.
>>
>>[[a]] and [[,a]] return the same result, that seems excessively
>>tolerant to me.
>>
>>[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a
>>tibble.
>>
>>And row subsetting/indexing has gone.
>>
>>Why create replacement for a dataframe that has no row indexing and so
>>radically redefines column indexing, in fact redefines the whole of
>>indexing and subsetting?
>>
>>OK. I will go to sleep now and hope to feel less dumb(ed) when I wake.
>>Perhaps Prof. Wickham or someone can spell out a bit less tersely, and
>>I think incompletely, than the tibble documentation does, why all this
>>is good.
>>
>>Thanks anyway Ista, you certainly hit the issue!
>>
>>Very best all,
>>
>>Chris
>>
>>> From: "Ista Zahn" <[hidden email]>
>>> To: "Chris Evans" <[hidden email]>
>>> Cc: "r-helpr-project.org" <[hidden email]>
>>> Sent: Tuesday, 6 December, 2016 21:40:41
>>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a
>>tibble
>>
>>> Not at a computer to check right now, but I believe single bracket
>>indexing a
>>> tibble always returns a tibble. To extract a vector use [[
>>
>>> On Dec 6, 2016 4:28 PM, "Chris Evans" < [hidden email] > wrote:
>>
>>>> I hope I am obeying the list rules here. I am using a raw R IDE for
>>this and
>>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>>
>>> > Here is a reproducible example. Code only first
>>
>>> > require(tibble)
>>> > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > min(tmpTibble[,2]) # fine
>>> > max(tmpTibble[,2]) # fine
>>> > median(tmpTibble[,2]) # not fine
>>> > mean(tmpTibble[,2]) # not fine
>>
>>> I think you want
>>
>>> mean(tmpTibble[[2]]
>>
>>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
>>necessary?!
>>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > newMedianFun(tmpTibble[,2]) # ditto
>>> > str(tmpTibble[,2])
>>
>>> > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > tmpTibble2
>>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>>
>>
>>>> ### before I just created tmpTibble2 I found myself trying to add a
>>column to
>>> > tmpTibble
>>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > ### and oddly enough ...
>>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>
>>> > Now here it is with the output:
>>
>>> > > require(tibble)
>>> > Loading required package: tibble
>>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > > min(tmpTibble[,2]) # fine
>>> > [1] 1
>>> > > max(tmpTibble[,2]) # fine
>>> > [1] 26
>>> > > median(tmpTibble[,2]) # not fine
>>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>>> > > mean(tmpTibble[,2]) # not fine
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble[, 2]) :
>>> > argument is not numeric or logical: returning NA
>>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't
>>be necessary?!
>>> > [1] 13.5
>>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > > newMedianFun(tmpTibble[,2]) # ditto
>>> > [1] 13.5
>>> > > str(tmpTibble[,2])
>>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>>
>>> > > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > > tmpTibble2
>>> > # A tibble: 26 × 3
>>> > ID num num2
>>> > <chr> <int> <dbl>
>>> > 1 a 1 0.1
>>> > 2 b 2 0.2
>>> > 3 c 3 0.3
>>> > 4 d 4 0.4
>>> > 5 e 5 0.5
>>> > 6 f 6 0.6
>>> > 7 g 7 0.7
>>> > 8 h 8 0.8
>>> > 9 i 9 0.9
>>> > 10 j 10 1.0
>>> > # ... with 16 more rows
>>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble2[, 3]) :
>>> > argument is not numeric or logical: returning NA
>>
>>
>>>> > ### before I just created tmpTibble2 I found myself trying to add
>>a column to
>>> > > tmpTibble
>>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > > ### and oddly enough ...
>>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>> > Error: Each variable must be a 1d atomic vector or list.
>>> > Problem variables: 'newNum'
>>
>>
>>
>>>> I discovered this when I hit odd behaviour after using read_spss()
>>from the
>>>> haven package for the first time as it seemed to be offering a step
>>forward
>>>> over good old read.spss() from the excellent foreign package. I am
>>reporting it
>>>> here not directly to Prof. Wickham as the issues seem rather general
>>though I'm
>>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps
>>I've
>>> > completely missed something.
>>
>>> > TIA,
>>
>>> > Chris
>>
>>> > ______________________________________________
>>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> [[alternative HTML version deleted]]
>>
>>______________________________________________
>>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Odd behaviour of mean() with a numeric column in a tibble

Ista Zahn
On Dec 10, 2016 4:59 PM, "Chris Evans" <[hidden email]> wrote:

Thanks to both Jeff and Ista for your inputs some days back.  I confess I
was _indeed_ too tired to be thinking well and laterally, and even to be
copying things into Emails successfully.

I have since had more sleep (!) and I have read ?`[[`, gone back to the
pertinent parts of "Introduction to R" and generally pondered all this.  I
confess I had always avoided [[ and only ever used it for lists that were
not data frames.  I can now see just how badly I was misguessing its
behaviour: apologies, I should have realised that I needed to go right back
to basics.

I _can_ see that there are things in the behaviour of data frames that are
not that obvious but I had become very used to them.  I can see values in
converting to using tibbles instead of data frames and may try to do that.

However, I think the documentation for tibble would be improved for people
like myself if it started with something that made it even clearer that
tibbles are lists, just as data frames are, but that whereas a data frame
has a single class(df) of "data.frame", class(tibble) is:
c("tbl_df","tbl","data.frame").

I can now see that what I get from ?tibble, i.e. "tibble is a trimmed down
version of data.frame" is probably technically true though I'd describe it
as a rationalised or even a beefed up version of data.frame.  I can also
now see that what I find in https://cran.r-project.org/
web/packages/tibble/tibble.pdf:

"[ Never simplifies (drops), so always returns data.frame"

is true, but only to the extent that any tibble is still a data.frame but
with "data.frame" moved to the third position in the classes of the tibble
where it would be the first and only class were it a pure data.frame.  I
can also see now that that is not really inconsistent with what I get in
https://github.com/tidyverse/tibble:

"Tibbles also clearly delineate [ and [[: [ always returns another tibble,
[[ always returns a vector. No more drop = FALSE!"

However, I think it would be better if the tibble.pdf document said:

"[ Never simplifies (drops), so always returns tibble" even though "[ Never
simplifies (drops), so always returns data.frame" is technically true, up
to and including passing is.data.frame() as

Finally, I think I can see that if want various functions I have written
that worked fine on data frames, but which depended on indexing or
subsetting those data frames using [,i] or sometimes [,i:j]to select
vectors or matrices, then I will have to modify them so they test whether
the input is a simple data frame or a data frame that is also a tibble.


only if you relied on [.data.frame returning a vector for length-one j.
Just use [[ (or always pass a drop argument) for that case and your
indexing code will work the same on pure data.frames and tbl_dfs.

--Ista

  I guess that I could have trapped things had my functions (where
appropriate) had an is.numeric() input check ... and that I have to use an
is.tibble() check, not an is.data.frame() check to distinguish the two!


Ah well, even after years of part-time use of R, I guess it's been good for
my soul and my deeper and wider understanding of R to go right back to the
basics.

Thanks again to you both.  I am posting here to convey thanks and in case
this is useful to anyone like myself who benefits from a bit more narrative
than is usually offered by R definitions and help entries.

Chris


----- Original Message -----
> From: "Jeff Newmiller" <[hidden email]>
> To: "Chris Evans" <[hidden email]>, "r-helpr-project.org" <
[hidden email]>
> Sent: Tuesday, 6 December, 2016 23:23:28
> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble

> You really need sleep. Then you need to read
>
> ?`[[`
>
> and in particular read about the second argument to the `[[` function,
since you
> don't seem to understand what it is for. Maybe reread the Introduction to
R
> document that comes with R.
>
> The simplest solution is to treat `[[` as supporting one index and `[` as
> supporting either one or two.
>
> As for expecting any form of row indexing of data frames or tibbles to
return a
> vector, that is hopeless because each column can have a different type.
dta[
> 1, ] returns exactly what it has to return to avoid losing fidelity. If
you
> really need row indexing to return a vector you should be using a matrix.
> --
> Sent from my phone. Please excuse my brevity.
>
> On December 6, 2016 2:10:15 PM PST, Chris Evans <[hidden email]>
wrote:

>>{{SIGH}}
>>
>>You are absolutely right.
>>
>>I wonder if I am losing some cognitive capacities that are needed to be
>>part of the evolving R community. It seems to me that if a tibble is
>>designed to be an enhanced replacement for a dataframe then it
>>shouldn't quite so radically change things.
>>
>>I notice that the documentation on tibble says "[ Never simplifies
>>(drops), so always returns data.frame"
>>That is much less explicit than I would have liked and actually doesn't
>>seem to be true. In fact, as you rightly say, it generally, but not
>>quite always, returns a tibble. In fact it can be fooled into a vector
>>of length 1.
>>
>>> tmpTibble[[1,]]
>>Error in `[[.data.frame`(tmpTibble, 1, ) :
>>argument "..2" is missing, with no default
>>
>>> tmpTibble[1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[,1]
>># A tibble: 26 × 1
>>ID
>><chr>
>>1 a
>>2 b
>>3 c
>>4 d
>>5 e
>>6 f
>>7 g
>>8 h
>>9 i
>>10 j
>># ... with 16 more rows
>>> tmpTibble[1,]
>>Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a",
>>:
>>replacement element 3 is a matrix/data frame of 26 rows, need 1
>>In addition: Warning messages:
>>1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 1 has 26 rows to replace 1 rows
>>2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", :
>>replacement element 2 has 26 rows to replace 1 rows
>>> tmpTibble[1,1:26]
>>Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
>>15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
>>> tmpTibble[[1,2]]
>>[1] 1
>>> str(tmpTibble[[1,2]])
>>int 1
>>> str(tmpTibble[[1:2,2]])
>>Error in col[[i, exact = exact]] :
>>attempt to select more than one element in vectorIndex
>>>
>>> tmpTibble[[1,1:2]]
>>[1] "b"
>>>
>>
>>So [[a,b]] works if a and b are legal with the dimensions of the tibble
>>and if a is of length 1 but returns NOT a tibble but a vector of length
>>1 (I think), I can see that's logical but not what it says in the
>>documentation.
>>
>>[[a]] and [[,a]] return the same result, that seems excessively
>>tolerant to me.
>>
>>[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a
>>tibble.
>>
>>And row subsetting/indexing has gone.
>>
>>Why create replacement for a dataframe that has no row indexing and so
>>radically redefines column indexing, in fact redefines the whole of
>>indexing and subsetting?
>>
>>OK. I will go to sleep now and hope to feel less dumb(ed) when I wake.
>>Perhaps Prof. Wickham or someone can spell out a bit less tersely, and
>>I think incompletely, than the tibble documentation does, why all this
>>is good.
>>
>>Thanks anyway Ista, you certainly hit the issue!
>>
>>Very best all,
>>
>>Chris
>>
>>> From: "Ista Zahn" <[hidden email]>
>>> To: "Chris Evans" <[hidden email]>
>>> Cc: "r-helpr-project.org" <[hidden email]>
>>> Sent: Tuesday, 6 December, 2016 21:40:41
>>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a
>>tibble
>>
>>> Not at a computer to check right now, but I believe single bracket
>>indexing a
>>> tibble always returns a tibble. To extract a vector use [[
>>
>>> On Dec 6, 2016 4:28 PM, "Chris Evans" < [hidden email] > wrote:
>>
>>>> I hope I am obeying the list rules here. I am using a raw R IDE for
>>this and
>>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)
>>
>>> > Here is a reproducible example. Code only first
>>
>>> > require(tibble)
>>> > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > min(tmpTibble[,2]) # fine
>>> > max(tmpTibble[,2]) # fine
>>> > median(tmpTibble[,2]) # not fine
>>> > mean(tmpTibble[,2]) # not fine
>>
>>> I think you want
>>
>>> mean(tmpTibble[[2]]
>>
>>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be
>>necessary?!
>>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > newMedianFun(tmpTibble[,2]) # ditto
>>> > str(tmpTibble[,2])
>>
>>> > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > tmpTibble2
>>> > mean(tmpTibble2[,3]) # not fine, not about integers!
>>
>>
>>>> ### before I just created tmpTibble2 I found myself trying to add a
>>column to
>>> > tmpTibble
>>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > ### and oddly enough ...
>>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>
>>> > Now here it is with the output:
>>
>>> > > require(tibble)
>>> > Loading required package: tibble
>>> > > tmpTibble <- tibble(ID=letters,num=1:26)
>>> > > min(tmpTibble[,2]) # fine
>>> > [1] 1
>>> > > max(tmpTibble[,2]) # fine
>>> > [1] 26
>>> > > median(tmpTibble[,2]) # not fine
>>> > Error in median.default(tmpTibble[, 2]) : need numeric data
>>> > > mean(tmpTibble[,2]) # not fine
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble[, 2]) :
>>> > argument is not numeric or logical: returning NA
>>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))}
>>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't
>>be necessary?!
>>> > [1] 13.5
>>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))}
>>> > > newMedianFun(tmpTibble[,2]) # ditto
>>> > [1] 13.5
>>> > > str(tmpTibble[,2])
>>> > Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 26 obs. of 1 variable:
>>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ...
>>
>>> > > ### then I tried this to make sure it wasn't about having fed in
>>integers
>>
>>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10)
>>> > > tmpTibble2
>>> > # A tibble: 26 × 3
>>> > ID num num2
>>> > <chr> <int> <dbl>
>>> > 1 a 1 0.1
>>> > 2 b 2 0.2
>>> > 3 c 3 0.3
>>> > 4 d 4 0.4
>>> > 5 e 5 0.5
>>> > 6 f 6 0.6
>>> > 7 g 7 0.7
>>> > 8 h 8 0.8
>>> > 9 i 9 0.9
>>> > 10 j 10 1.0
>>> > # ... with 16 more rows
>>> > > mean(tmpTibble2[,3]) # not fine, not about integers!
>>> > [1] NA
>>> > Warning message:
>>> > In mean.default(tmpTibble2[, 3]) :
>>> > argument is not numeric or logical: returning NA
>>
>>
>>>> > ### before I just created tmpTibble2 I found myself trying to add
>>a column to
>>> > > tmpTibble
>>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO!
>>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO!
>>> > > ### and oddly enough ...
>>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!
>>> > Error: Each variable must be a 1d atomic vector or list.
>>> > Problem variables: 'newNum'
>>
>>
>>
>>>> I discovered this when I hit odd behaviour after using read_spss()
>>from the
>>>> haven package for the first time as it seemed to be offering a step
>>forward
>>>> over good old read.spss() from the excellent foreign package. I am
>>reporting it
>>>> here not directly to Prof. Wickham as the issues seem rather general
>>though I'm
>>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps
>>I've
>>> > completely missed something.
>>
>>> > TIA,
>>
>>> > Chris
>>
>>> > ______________________________________________
>>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>      [[alternative HTML version deleted]]
>>
>>______________________________________________
>>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.