sum() (and similar methods) should work for zero row data.frames

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

sum() (and similar methods) should work for zero row data.frames

Martin-4
The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
> sum(data.frame(x = numeric(0)))
#> Error in FUN(X[[i]], ...) :
#>   only defined on a data frame with all numeric variables
Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.

The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.

I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.

1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.

2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
> any(data.frame(x = 1))
#> [1] TRUE
#> Warning message:
#> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
and
> any(data.frame(x = TRUE))
#> Error in FUN(X[[i]], ...) :
#>   only defined on a data frame with all numeric variables
So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.

(I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)

Best,
Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() (and similar methods) should work for zero row data.frames

Peter Dalgaard-2
Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this

> a <- na.omit(airquality)
> sum(a)
[1] 37495.3
> sum(a[FALSE,])
Error in FUN(X[[i]], ...) :
  only defined on a data frame with all numeric variables

Or, closer to an actual use case:

> sum(subset(a, Ozone>100))
[1] 3330.5
> sum(subset(a, Ozone>200))
Error in FUN(X[[i]], ...) :
  only defined on a data frame with all numeric variables


However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?

> sum(as.matrix(a[FALSE,]))
[1] 0

-pd

> On 17 Oct 2020, at 21:18 , Martin <[hidden email]> wrote:
>
> The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
>> sum(data.frame(x = numeric(0)))
> #> Error in FUN(X[[i]], ...) :
> #>   only defined on a data frame with all numeric variables
> Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
>
> The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
>
> I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
>
> 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
>> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
> Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
>
> 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
>> any(data.frame(x = 1))
> #> [1] TRUE
> #> Warning message:
> #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
> and
>> any(data.frame(x = TRUE))
> #> Error in FUN(X[[i]], ...) :
> #>   only defined on a data frame with all numeric variables
> So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
>
> (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
>
> Best,
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() (and similar methods) should work for zero row data.frames

Gabriel Becker-2
Peter et al,

I had the same thought, in particular for any() and all(), which in as much
as they should work on data.frames in the first place (which to be
perfectly honest i do find quite debatable myself), should certainly work
on "logical" data.frames if they are going to work on "numeric" ones.

I can volunteer to prepare a patch if Martin (the reporter) did not want to
take a crack at it, and further if it is not already being done within
R-core.

Best,
~G

On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <[hidden email]> wrote:

> Hmm, yes, this is probably wrong. E.g., we are likely to get
> inconsistencies out of boundary cases like this
>
> > a <- na.omit(airquality)
> > sum(a)
> [1] 37495.3
> > sum(a[FALSE,])
> Error in FUN(X[[i]], ...) :
>   only defined on a data frame with all numeric variables
>
> Or, closer to an actual use case:
>
> > sum(subset(a, Ozone>100))
> [1] 3330.5
> > sum(subset(a, Ozone>200))
> Error in FUN(X[[i]], ...) :
>   only defined on a data frame with all numeric variables
>
>
> However, given that numeric summaries generally treat logicals as 0/1,
> wouldn't it be easiest just to extend the check inside Summary.data.frame
> with "&& !is.logical(x)"?
>
> > sum(as.matrix(a[FALSE,]))
> [1] 0
>
> -pd
>
> > On 17 Oct 2020, at 21:18 , Martin <[hidden email]> wrote:
> >
> > The "Summary" group generics always throw errors for a data.frame with
> zero rows, for example:
> >> sum(data.frame(x = numeric(0)))
> > #> Error in FUN(X[[i]], ...) :
> > #>   only defined on a data frame with all numeric variables
> > Same behaviour for min, max, any, all, ... . I believe this is
> inconsistent with what these methods do for other empty objects (vectors,
> matrices), where the return value is chosen to ensure transitivity:
> sum(numeric(0)) == 0.
> >
> > The reason for this is that the return type of as.matrix() for empty (no
> rows or no columns) data.frame objects is always a matrix of type
> "logical". The Summary method for data.frame, in turn, throws an error when
> the data.frame, converted to a matrix, is not of numeric type.
> >
> > I suggest two ways that make sum, min, max, ... more consistent. IMHO it
> would be fitting to implement both of these fixes, because they also make
> other things more consistent.
> >
> > 1. Make the return type of as.matrix() for zero-row data.frames
> consistent with the type that would have been returned, had the data.frame
> had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should
> then be numeric, if there is an empty "character" column the return matrix
> should be a character etc. This would make subsetting by row and conversion
> to matrix commute (except for row names sometimes):
> >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, ,
> drop = FALSE])
> > Furthermore, this change would make as.matrix.data.frame obey the
> documentation, which indicates that the coercion hierarchy is used for the
> return type.
> >
> > 2. Make the Summary.data.frame method accept data.frames that produce
> non-numeric matrices. Next to the main focus of this message, I believe it
> would e.g. be fitting to have any() and all() work on logical data.frame
> objects. The current behaviour is such that
> >> any(data.frame(x = 1))
> > #> [1] TRUE
> > #> Warning message:
> > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to
> logical
> > and
> >> any(data.frame(x = TRUE))
> > #> Error in FUN(X[[i]], ...) :
> > #>   only defined on a data frame with all numeric variables
> > So a numeric data.frame warns about implicit coercion, while a logical
> data.frame (which would not need coercion) does not work at all.
> >
> > (I feel more strongly about fixing 1. than 2., because I don't know the
> discussion that lead to the behaviour described in 2.)
> >
> > Best,
> > Martin
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email]  Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() (and similar methods) should work for zero row data.frames

Martin-4
From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.

Best, Martin

On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:

> Peter et al,
>
> I had the same thought, in particular for any() and all(), which in as
> much as they should work on data.frames in the first place (which to be
> perfectly honest i do find quite debatable myself), should certainly
> work on "logical" data.frames if they are going to work on "numeric"
> ones.
>
> I can volunteer to prepare a patch if Martin (the reporter) did not
> want to take a crack at it, and further if it is not already being done
> within R-core.
>
> Best,
> ~G
>
> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <[hidden email]> wrote:
> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
> >
> > > a <- na.omit(airquality)
> > > sum(a)
> > [1] 37495.3
> > > sum(a[FALSE,])
> > Error in FUN(X[[i]], ...) :
> >   only defined on a data frame with all numeric variables
> >
> > Or, closer to an actual use case:
> >
> > > sum(subset(a, Ozone>100))
> > [1] 3330.5
> > > sum(subset(a, Ozone>200))
> > Error in FUN(X[[i]], ...) :
> >   only defined on a data frame with all numeric variables
> >
> >
> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
> >
> > > sum(as.matrix(a[FALSE,]))
> > [1] 0
> >
> > -pd
> >
> > > On 17 Oct 2020, at 21:18 , Martin <[hidden email]> wrote:
> > >
> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
> > >> sum(data.frame(x = numeric(0)))
> > > #> Error in FUN(X[[i]], ...) :
> > > #>   only defined on a data frame with all numeric variables
> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
> > >
> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
> > >
> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
> > >
> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
> > >
> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
> > >> any(data.frame(x = 1))
> > > #> [1] TRUE
> > > #> Warning message:
> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
> > > and
> > >> any(data.frame(x = TRUE))
> > > #> Error in FUN(X[[i]], ...) :
> > > #>   only defined on a data frame with all numeric variables
> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
> > >
> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
> > >
> > > Best,
> > > Martin
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: [hidden email]  Priv: [hidden email]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() (and similar methods) should work for zero row data.frames

Martin Maechler
>>>>> mb706  
>>>>>     on Sun, 18 Oct 2020 22:14:55 +0200 writes:

    >> From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.

    > Best, Martin

Basically, just

1.  svn co https://svn.r-project.org/R/trunk  R-devel

2.  inside the R-devel source tree, find  src/library/base/R/dataframe.R
    make the *minimal* changes there,

    (then also add some regression tests and update the help :-)

3.  inside R-devel, do

        svn diff -x -ubw  >  mb706.patch

4.  you've got the patch file  mb706.patch  which you could
    attach to a bug report  on R's bugzilla

    (once you've got an account there ...
     As you've asked for that *and* as you've proven your good
     judgment about "true bug" vs. "not what I expected",
     I'll create such an account for you now, in spite of the
     fact that I'd still like to know a bit more than "Martin
     mb706" about you  ...)

The changes have been committed to R-devel a quarter of an hour ago.
We will keep them in R-devel (currently planned to become R
4.1.0 in spring 2021), and not port to the R-4.0.z branch, as
the change is something like an API change, and also because
nobody had ever reported this as an issue to our knowledge.

Thank you, Martin B706 for bringing the issue up,  and Gabe and Peter
for chiming in !!

Best regards,
Martin Maechler
ETH Zurich  and  R core team
   

    > On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:
    >> Peter et al,
    >>
    >> I had the same thought, in particular for any() and all(), which in as
    >> much as they should work on data.frames in the first place (which to be
    >> perfectly honest i do find quite debatable myself), should certainly
    >> work on "logical" data.frames if they are going to work on "numeric"
    >> ones.
    >>
    >> I can volunteer to prepare a patch if Martin (the reporter) did not
    >> want to take a crack at it, and further if it is not already being done
    >> within R-core.
    >>
    >> Best,
    >> ~G
    >>
    >> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <[hidden email]> wrote:
    >> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
    >> >
    >> > > a <- na.omit(airquality)
    >> > > sum(a)
    >> > [1] 37495.3
    >> > > sum(a[FALSE,])
    >> > Error in FUN(X[[i]], ...) :
    >> >   only defined on a data frame with all numeric variables
    >> >
    >> > Or, closer to an actual use case:
    >> >
    >> > > sum(subset(a, Ozone>100))
    >> > [1] 3330.5
    >> > > sum(subset(a, Ozone>200))
    >> > Error in FUN(X[[i]], ...) :
    >> >   only defined on a data frame with all numeric variables
    >> >
    >> >
    >> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
    >> >
    >> > > sum(as.matrix(a[FALSE,]))
    >> > [1] 0
    >> >
    >> > -pd
    >> >
    >> > > On 17 Oct 2020, at 21:18 , Martin <[hidden email]> wrote:
    >> > >
    >> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
    >> > >> sum(data.frame(x = numeric(0)))
    >> > > #> Error in FUN(X[[i]], ...) :
    >> > > #>   only defined on a data frame with all numeric variables
    >> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
    >> > >
    >> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
    >> > >
    >> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
    >> > >
    >> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
    >> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
    >> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
    >> > >
    >> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
    >> > >> any(data.frame(x = 1))
    >> > > #> [1] TRUE
    >> > > #> Warning message:
    >> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
    >> > > and
    >> > >> any(data.frame(x = TRUE))
    >> > > #> Error in FUN(X[[i]], ...) :
    >> > > #>   only defined on a data frame with all numeric variables
    >> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
    >> > >
    >> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
    >> > >
    >> > > Best,
    >> > > Martin
    >> > >
    >> > > ______________________________________________
    >> > > [hidden email] mailing list
    >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >
    >> > --
    >> > Peter Dalgaard, Professor,
    >> > Center for Statistics, Copenhagen Business School
    >> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    >> > Phone: (+45)38153501
    >> > Office: A 4.23
    >> > Email: [hidden email]  Priv: [hidden email]
    >> >
    >> > ______________________________________________
    >> > [hidden email] mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() (and similar methods) should work for zero row data.frames

Hervé Pagès-2
Hi,

There are 2 bugs here. The proposed fix to Summary.data.frame() is fine
but it doesn't address the other problem reported by the OP that
as.matrix() on a zero-row data.frame doesn't respect the type of its
columns, like other column-combining operations do:

   df <- data.frame(a=numeric(0), b=numeric(0))

   typeof(as.matrix(df))
   # [1] "logical"

   typeof(unlist(df))
   # [1] "double"

   typeof(do.call(c, df))
   # [1] "double"

I've run myself into this in a couple of occasions (not in the context
of Summary methods) and worked around it with something like:

   as_matrix_data_frame <- function(df)
   {
     ans <- as.matrix(df)
     if (nrow(df) == 0L)
         storage.mode(ans) <- typeof(unlist(df))
     ans
   }

No reason as.matrix.data.frame() couldn't do something similar.

Cheers,
H.


On 10/20/20 09:36, Martin Maechler wrote:

>>>>>> mb706
>>>>>>      on Sun, 18 Oct 2020 22:14:55 +0200 writes:
>
>      >> From my side: it would be great if you (or R core) could prepare a patch, it would probably take me quite a bit longer than you since I don't have experience creating patches for R.
>
>      > Best, Martin
>
> Basically, just
>
> 1.  svn co https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.r-2Dproject.org_R_trunk&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=PpmVRjh2Jrg07bLHjlbhdBgWQWAFe6RK_J2SivC74vw&e=   R-devel
>
> 2.  inside the R-devel source tree, find  src/library/base/R/dataframe.R
>      make the *minimal* changes there,
>
>      (then also add some regression tests and update the help :-)
>
> 3.  inside R-devel, do
>
>          svn diff -x -ubw  >  mb706.patch
>
> 4.  you've got the patch file  mb706.patch  which you could
>      attach to a bug report  on R's bugzilla
>
>      (once you've got an account there ...
>       As you've asked for that *and* as you've proven your good
>       judgment about "true bug" vs. "not what I expected",
>       I'll create such an account for you now, in spite of the
>       fact that I'd still like to know a bit more than "Martin
>       mb706" about you  ...)
>
> The changes have been committed to R-devel a quarter of an hour ago.
> We will keep them in R-devel (currently planned to become R
> 4.1.0 in spring 2021), and not port to the R-4.0.z branch, as
> the change is something like an API change, and also because
> nobody had ever reported this as an issue to our knowledge.
>
> Thank you, Martin B706 for bringing the issue up,  and Gabe and Peter
> for chiming in !!
>
> Best regards,
> Martin Maechler
> ETH Zurich  and  R core team
>      
>
>      > On Sun, Oct 18, 2020, at 21:49, Gabriel Becker wrote:
>      >> Peter et al,
>      >>
>      >> I had the same thought, in particular for any() and all(), which in as
>      >> much as they should work on data.frames in the first place (which to be
>      >> perfectly honest i do find quite debatable myself), should certainly
>      >> work on "logical" data.frames if they are going to work on "numeric"
>      >> ones.
>      >>
>      >> I can volunteer to prepare a patch if Martin (the reporter) did not
>      >> want to take a crack at it, and further if it is not already being done
>      >> within R-core.
>      >>
>      >> Best,
>      >> ~G
>      >>
>      >> On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <[hidden email]> wrote:
>      >> > Hmm, yes, this is probably wrong. E.g., we are likely to get inconsistencies out of boundary cases like this
>      >> >
>      >> > > a <- na.omit(airquality)
>      >> > > sum(a)
>      >> > [1] 37495.3
>      >> > > sum(a[FALSE,])
>      >> > Error in FUN(X[[i]], ...) :
>      >> >   only defined on a data frame with all numeric variables
>      >> >
>      >> > Or, closer to an actual use case:
>      >> >
>      >> > > sum(subset(a, Ozone>100))
>      >> > [1] 3330.5
>      >> > > sum(subset(a, Ozone>200))
>      >> > Error in FUN(X[[i]], ...) :
>      >> >   only defined on a data frame with all numeric variables
>      >> >
>      >> >
>      >> > However, given that numeric summaries generally treat logicals as 0/1, wouldn't it be easiest just to extend the check inside Summary.data.frame with "&& !is.logical(x)"?
>      >> >
>      >> > > sum(as.matrix(a[FALSE,]))
>      >> > [1] 0
>      >> >
>      >> > -pd
>      >> >
>      >> > > On 17 Oct 2020, at 21:18 , Martin <[hidden email]> wrote:
>      >> > >
>      >> > > The "Summary" group generics always throw errors for a data.frame with zero rows, for example:
>      >> > >> sum(data.frame(x = numeric(0)))
>      >> > > #> Error in FUN(X[[i]], ...) :
>      >> > > #>   only defined on a data frame with all numeric variables
>      >> > > Same behaviour for min, max, any, all, ... . I believe this is inconsistent with what these methods do for other empty objects (vectors, matrices), where the return value is chosen to ensure transitivity: sum(numeric(0)) == 0.
>      >> > >
>      >> > > The reason for this is that the return type of as.matrix() for empty (no rows or no columns) data.frame objects is always a matrix of type "logical". The Summary method for data.frame, in turn, throws an error when the data.frame, converted to a matrix, is not of numeric type.
>      >> > >
>      >> > > I suggest two ways that make sum, min, max, ... more consistent. IMHO it would be fitting to implement both of these fixes, because they also make other things more consistent.
>      >> > >
>      >> > > 1. Make the return type of as.matrix() for zero-row data.frames consistent with the type that would have been returned, had the data.frame had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should then be numeric, if there is an empty "character" column the return matrix should be a character etc. This would make subsetting by row and conversion to matrix commute (except for row names sometimes):
>      >> > >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, , drop = FALSE])
>      >> > > Furthermore, this change would make as.matrix.data.frame obey the documentation, which indicates that the coercion hierarchy is used for the return type.
>      >> > >
>      >> > > 2. Make the Summary.data.frame method accept data.frames that produce non-numeric matrices. Next to the main focus of this message, I believe it would e.g. be fitting to have any() and all() work on logical data.frame objects. The current behaviour is such that
>      >> > >> any(data.frame(x = 1))
>      >> > > #> [1] TRUE
>      >> > > #> Warning message:
>      >> > > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to logical
>      >> > > and
>      >> > >> any(data.frame(x = TRUE))
>      >> > > #> Error in FUN(X[[i]], ...) :
>      >> > > #>   only defined on a data frame with all numeric variables
>      >> > > So a numeric data.frame warns about implicit coercion, while a logical data.frame (which would not need coercion) does not work at all.
>      >> > >
>      >> > > (I feel more strongly about fixing 1. than 2., because I don't know the discussion that lead to the behaviour described in 2.)
>      >> > >
>      >> > > Best,
>      >> > > Martin
>      >> > >
>      >> > > ______________________________________________
>      >> > > [hidden email] mailing list
>      >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>      >> >
>      >> > --
>      >> > Peter Dalgaard, Professor,
>      >> > Center for Statistics, Copenhagen Business School
>      >> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>      >> > Phone: (+45)38153501
>      >> > Office: A 4.23
>      >> > Email: [hidden email]  Priv: [hidden email]
>      >> >
>      >> > ______________________________________________
>      >> > [hidden email] mailing list
>      >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>
>      > ______________________________________________
>      > [hidden email] mailing list
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>
> ______________________________________________
> [hidden email] mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=YAI4LgZvkD5k-tPHUGFX4PEjm72-6j_WxHpkdHfe_3Q&s=q0b1qGN5IxjiKAeQYAkmEKNdqyTOXnuIAFtuPTiPli8&e=
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel