Undefined behavior of head() and tail() with n = 0

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Undefined behavior of head() and tail() with n = 0

florent.angly@gmail.com
Hi all,

The documentation for head() and tail() describes the behavior of
these generic functions when n is strictly positive (n > 0) and
strictly negative (n < 0). How these functions work when given a zero
value is not defined.

Both GNU command-line utilities head and tail behave differently with +0 and -0:
http://man7.org/linux/man-pages/man1/head.1.html
http://man7.org/linux/man-pages/man1/tail.1.html

Since R supports signed zeros (1/+0 != 1/-0) and the R head() and
tail() functions are modeled after their GNU counterparts, I would
expect the R functions to distinguish between +0 and -0

> tail(1:5, n=0)
integer(0)
> tail(1:5, n=1)
[1] 5
> tail(1:5, n=2)
[1] 4 5

> tail(1:5, n=-2)
[1] 3 4 5
> tail(1:5, n=-1)
[1] 2 3 4 5
> tail(1:5, n=-0)
integer(0)  # expected 1:5

> head(1:5, n=0)
integer(0)
> head(1:5, n=1)
[1] 1
> head(1:5, n=2)
[1] 1 2

> head(1:5, n=-2)
[1] 1 2 3
> head(1:5, n=-1)
[1] 1 2 3 4
> head(1:5, n=-0)
integer(0)  # expected 1:5

For both head() and tail(), I expected 1:5 as output but got
integer(0). I obtained similar results using a data.frame and a
function as x argument.

An easy fix would be to explicitly state in the documentation what n =
0 does, and that there is no practical difference between -0 and +0.
However, in my eyes, the better approach would be implement support
for -0 and document it. What do you think?

Best,

Florent


PS/ My sessionInfo() gives:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Switzerland.1252
LC_CTYPE=German_Switzerland.1252
LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
 LC_TIME=German_Switzerland.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Undefined behavior of head() and tail() with n = 0

Martin Maechler
>>>>> Florent Angly <[hidden email]>
>>>>>     on Wed, 25 Jan 2017 16:31:45 +0100 writes:

    > Hi all,
    > The documentation for head() and tail() describes the behavior of
    > these generic functions when n is strictly positive (n > 0) and
    > strictly negative (n < 0). How these functions work when given a zero
    > value is not defined.

    > Both GNU command-line utilities head and tail behave differently with +0 and -0:
    > http://man7.org/linux/man-pages/man1/head.1.html
    > http://man7.org/linux/man-pages/man1/tail.1.html

    > Since R supports signed zeros (1/+0 != 1/-0)

whoa, whoa, .. slow down --  The above is misleading!

Rather read in  ?Arithmetic (*the* reference to consult for such issues),
where the 2nd part of the following section

 || Implementation limits:
 ||
 ||      [..............]
 ||
 ||      Another potential issue is signed zeroes: on IEC 60659 platforms
 ||      there are two zeroes with internal representations differing by
 ||      sign.  Where possible R treats them as the same, but for example
 ||      direct output from C code often does not do so and may output
 ||      ‘-0.0’ (and on Windows whether it does so or not depends on the
 ||      version of Windows).  One place in R where the difference might be
 ||      seen is in division by zero: ‘1/x’ is ‘Inf’ or ‘-Inf’ depending on
 ||      the sign of zero ‘x’.  Another place is ‘identical(0, -0, num.eq =
 ||      FALSE)’.

says the *contrary* ( __Where possible R treats them as the same__ ):
We do _not_ want to distinguish -0 and +0,
but there are cases where it is inavoidable

And there are good reasons (mathematics !!) for this.

I'm pretty sure that it would be quite a mistake to start
differentiating it here...  but of course we can continue
discussing here if you like.

Martin Maechler
ETH Zurich and R Core


    > and the R head() and tail() functions are modeled after
    > their GNU counterparts, I would expect the R functions to
    > distinguish between +0 and -0

    >> tail(1:5, n=0)
    > integer(0)
    >> tail(1:5, n=1)
    > [1] 5
    >> tail(1:5, n=2)
    > [1] 4 5

    >> tail(1:5, n=-2)
    > [1] 3 4 5
    >> tail(1:5, n=-1)
    > [1] 2 3 4 5
    >> tail(1:5, n=-0)
    > integer(0)  # expected 1:5

    >> head(1:5, n=0)
    > integer(0)
    >> head(1:5, n=1)
    > [1] 1
    >> head(1:5, n=2)
    > [1] 1 2

    >> head(1:5, n=-2)
    > [1] 1 2 3
    >> head(1:5, n=-1)
    > [1] 1 2 3 4
    >> head(1:5, n=-0)
    > integer(0)  # expected 1:5

    > For both head() and tail(), I expected 1:5 as output but got
    > integer(0). I obtained similar results using a data.frame and a
    > function as x argument.

    > An easy fix would be to explicitly state in the documentation what n =
    > 0 does, and that there is no practical difference between -0 and +0.
    > However, in my eyes, the better approach would be implement support
    > for -0 and document it. What do you think?

    > Best,

    > Florent


    > PS/ My sessionInfo() gives:
    > R version 3.3.2 (2016-10-31)
    > Platform: x86_64-w64-mingw32/x64 (64-bit)
    > Running under: Windows 7 x64 (build 7601) Service Pack 1

    > locale:
    > [1] LC_COLLATE=German_Switzerland.1252
    > LC_CTYPE=German_Switzerland.1252
    > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
    > LC_TIME=German_Switzerland.1252

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

RFC: tapply(*, ..., init.value = NA)

Martin Maechler
Last week, we've talked here about "xtabs(), factors and NAs",
 ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".

In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses  tapply(...,  FUN = sum)   and
I now think there is a missing feature in tapply() there, which
I am proposing to change.

Look at a small example:

> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
, , N = 3

   L
n   A B C D E F
  1 1 2 0 0 0 0
  2 0 0 1 2 0 0
  3 0 0 0 0 2 2

> DN <- D2; DN[1,"N"] <- NA; DN
   n L  N
2  1 A NA
3  1 B  3
4  1 B  3
6  2 C  3
7  2 D  3
8  2 D  3
9  3 E  3
10 3 E  3
11 3 F  3
12 3 F  3
> with(DN, tapply(N, list(n,L), FUN=sum))
   A  B  C  D  E  F
1 NA  6 NA NA NA NA
2 NA NA  3  6 NA NA
3 NA NA NA NA  6  6
>  

and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:

1) at ["1", "A"], the  NA  comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
   *and* from the fact that tapply() uses

   array(dim = .., dimnames = ...)

i.e., initializes the array with NAs  (see definition of 'array').

My proposition is the following patch to  tapply(), adding a new
option 'init.value':

-----------------------------------------------------------------------------
 
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
 {
     FUN <- if (!is.null(FUN)) match.fun(FUN)
     if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
     index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
     ans <- lapply(X = ans[index], FUN = FUN, ...)
     if (simplify && all(lengths(ans) == 1L)) {
- ansmat <- array(dim = extent, dimnames = namelist)
+ ansmat <- array(init.value, dim = extent, dimnames = namelist)
  ans <- unlist(ans, recursive = FALSE)
     } else {
  ansmat <- array(vector("list", prod(extent)),

-----------------------------------------------------------------------------

With that, I can set the initial value to '0' instead of array's
default of NA :

> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
   A B C D E F
1 NA 6 0 0 0 0
2  0 0 3 6 0 0
3  0 0 0 0 6 6
>

which now has 0 counts and NA  as is desirable to be used inside
xtabs().

All fine... and would not be worth a posting to R-devel,
except for this:

The change will not be 100% back compatible -- by necessity: any new argument for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'.  The new function would be

> str(tapply)
function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

where the '...' are passed FUN(),  and with the new signature,
'init.value' then won't be passed to FUN  "anymore" (compared to
R <= 3.3.x).

For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
the probability the arg name is used in other functions).


Opinions?

Thank you in advance,
Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

R devel mailing list
It would be cool if the default for tapply's init.value could be
FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
FUN=all, -Inf for FUN=max, etc.  But that would take time and would
break code for which FUN did not work on length-0 objects.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
<[hidden email]> wrote:

> Last week, we've talked here about "xtabs(), factors and NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
> 2) all other NAs come from the fact that there is no such factor combination
>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
> -----------------------------------------------------------------------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
> -----------------------------------------------------------------------------
>
> With that, I can set the initial value to '0' instead of array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore" (compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Undefined behavior of head() and tail() with n = 0

R devel mailing list
In reply to this post by Martin Maechler
In addition, signed zeroes only exist for floating point numbers - the
bit patterns for as.integer(0) and as.integer(-0) are identical.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 1:53 AM, Martin Maechler
<[hidden email]> wrote:

>>>>>> Florent Angly <[hidden email]>
>>>>>>     on Wed, 25 Jan 2017 16:31:45 +0100 writes:
>
>     > Hi all,
>     > The documentation for head() and tail() describes the behavior of
>     > these generic functions when n is strictly positive (n > 0) and
>     > strictly negative (n < 0). How these functions work when given a zero
>     > value is not defined.
>
>     > Both GNU command-line utilities head and tail behave differently with +0 and -0:
>     > http://man7.org/linux/man-pages/man1/head.1.html
>     > http://man7.org/linux/man-pages/man1/tail.1.html
>
>     > Since R supports signed zeros (1/+0 != 1/-0)
>
> whoa, whoa, .. slow down --  The above is misleading!
>
> Rather read in  ?Arithmetic (*the* reference to consult for such issues),
> where the 2nd part of the following section
>
>  || Implementation limits:
>  ||
>  ||      [..............]
>  ||
>  ||      Another potential issue is signed zeroes: on IEC 60659 platforms
>  ||      there are two zeroes with internal representations differing by
>  ||      sign.  Where possible R treats them as the same, but for example
>  ||      direct output from C code often does not do so and may output
>  ||      ‘-0.0’ (and on Windows whether it does so or not depends on the
>  ||      version of Windows).  One place in R where the difference might be
>  ||      seen is in division by zero: ‘1/x’ is ‘Inf’ or ‘-Inf’ depending on
>  ||      the sign of zero ‘x’.  Another place is ‘identical(0, -0, num.eq =
>  ||      FALSE)’.
>
> says the *contrary* ( __Where possible R treats them as the same__ ):
> We do _not_ want to distinguish -0 and +0,
> but there are cases where it is inavoidable
>
> And there are good reasons (mathematics !!) for this.
>
> I'm pretty sure that it would be quite a mistake to start
> differentiating it here...  but of course we can continue
> discussing here if you like.
>
> Martin Maechler
> ETH Zurich and R Core
>
>
>     > and the R head() and tail() functions are modeled after
>     > their GNU counterparts, I would expect the R functions to
>     > distinguish between +0 and -0
>
>     >> tail(1:5, n=0)
>     > integer(0)
>     >> tail(1:5, n=1)
>     > [1] 5
>     >> tail(1:5, n=2)
>     > [1] 4 5
>
>     >> tail(1:5, n=-2)
>     > [1] 3 4 5
>     >> tail(1:5, n=-1)
>     > [1] 2 3 4 5
>     >> tail(1:5, n=-0)
>     > integer(0)  # expected 1:5
>
>     >> head(1:5, n=0)
>     > integer(0)
>     >> head(1:5, n=1)
>     > [1] 1
>     >> head(1:5, n=2)
>     > [1] 1 2
>
>     >> head(1:5, n=-2)
>     > [1] 1 2 3
>     >> head(1:5, n=-1)
>     > [1] 1 2 3 4
>     >> head(1:5, n=-0)
>     > integer(0)  # expected 1:5
>
>     > For both head() and tail(), I expected 1:5 as output but got
>     > integer(0). I obtained similar results using a data.frame and a
>     > function as x argument.
>
>     > An easy fix would be to explicitly state in the documentation what n =
>     > 0 does, and that there is no practical difference between -0 and +0.
>     > However, in my eyes, the better approach would be implement support
>     > for -0 and document it. What do you think?
>
>     > Best,
>
>     > Florent
>
>
>     > PS/ My sessionInfo() gives:
>     > R version 3.3.2 (2016-10-31)
>     > Platform: x86_64-w64-mingw32/x64 (64-bit)
>     > Running under: Windows 7 x64 (build 7601) Service Pack 1
>
>     > locale:
>     > [1] LC_COLLATE=German_Switzerland.1252
>     > LC_CTYPE=German_Switzerland.1252
>     > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
>     > LC_TIME=German_Switzerland.1252
>
>     > attached base packages:
>     > [1] stats     graphics  grDevices utils     datasets  methods   base
>
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Henrik Bengtsson-5
In reply to this post by R devel mailing list
On a related note, the storage mode should try to match ans[[1]] (or
unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full
copy.

Henrik


On Jan 26, 2017 07:50, "William Dunlap via R-devel" <[hidden email]>
wrote:

It would be cool if the default for tapply's init.value could be
FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
FUN=all, -Inf for FUN=max, etc.  But that would take time and would
break code for which FUN did not work on length-0 objects.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
<[hidden email]> wrote:

> Last week, we've talked here about "xtabs(), factors and NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2)

> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
> 2) all other NAs come from the fact that there is no such factor
combination

>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
> ------------------------------------------------------------
-----------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify
= TRUE)

>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
> ------------------------------------------------------------
-----------------

>
> With that, I can set the initial value to '0' instead of array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new
argument for

> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore" (compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Martin Maechler

    > On Jan 26, 2017 07:50, "William Dunlap via R-devel" <[hidden email]>
    > wrote:

    > It would be cool if the default for tapply's init.value could be
    > FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
    > FUN=all, -Inf for FUN=max, etc.  But that would take time and would
    > break code for which FUN did not work on length-0 objects.

    > Bill Dunlap
    > TIBCO Software
    > wdunlap tibco.com

I had the same idea (after my first post), so I agree that would
be nice. One could argue it would take time only if the user is too lazy
to specify the value,  and we could use
   tryCatch(FUN(X[0]), error = NA)
to safeguard against those functions that fail for 0 length arg.

But I think the main reason for _not_ setting such a default is
back-compatibility.  In my proposal, the new argument would not
be any change by default and so all current uses of tapply()
would remain unchanged.

>>>>> Henrik Bengtsson <[hidden email]>
>>>>>     on Thu, 26 Jan 2017 07:57:08 -0800 writes:

    > On a related note, the storage mode should try to match ans[[1]] (or
    > unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full
    > copy.

Yes, related indeed; and would fall "in line" with Bill's idea.
OTOH, it could be implemented independently,
by something like

   if(missing(init.value))
     init.value <-
       if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]]))
       else NA

.............

A colleague proposed to use the shorter argument name 'default'
instead of 'init.value'  which indeed maybe more natural and
still not too often used as "non-first" argument in  FUN(.).

Thank you for the constructive feedback!
Martin

    > On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
    > <[hidden email]> wrote:
    >> Last week, we've talked here about "xtabs(), factors and NAs",
    -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
    >>
    >> In the mean time, I've spent several hours on the issue
    >> and also committed changes to R-devel "in two iterations".
    >>
    >> In the case there is a *Left* hand side part to xtabs() formula,
    >> see the help page example using 'esoph',
    >> it uses  tapply(...,  FUN = sum)   and
    >> I now think there is a missing feature in tapply() there, which
    >> I am proposing to change.
    >>
    >> Look at a small example:
    >>
    >>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
    > N=3)[-c(1,5), ]; xtabs(~., D2)
    >> , , N = 3
    >>
    >> L
    >> n   A B C D E F
    >> 1 1 2 0 0 0 0
    >> 2 0 0 1 2 0 0
    >> 3 0 0 0 0 2 2
    >>
    >>> DN <- D2; DN[1,"N"] <- NA; DN
    >> n L  N
    >> 2  1 A NA
    >> 3  1 B  3
    >> 4  1 B  3
    >> 6  2 C  3
    >> 7  2 D  3
    >> 8  2 D  3
    >> 9  3 E  3
    >> 10 3 E  3
    >> 11 3 F  3
    >> 12 3 F  3
    >>> with(DN, tapply(N, list(n,L), FUN=sum))
    >> A  B  C  D  E  F
    >> 1 NA  6 NA NA NA NA
    >> 2 NA NA  3  6 NA NA
    >> 3 NA NA NA NA  6  6
    >>>
    >>
    >> and as you can see, the resulting matrix has NAs, all the same
    >> NA_real_, but semantically of two different kinds:
    >>
    >> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
    >> 2) all other NAs come from the fact that there is no such factor
    > combination
    >> *and* from the fact that tapply() uses
    >>
    >> array(dim = .., dimnames = ...)
    >>
    >> i.e., initializes the array with NAs  (see definition of 'array').
    >>
    >> My proposition is the following patch to  tapply(), adding a new
    >> option 'init.value':
    >>
    >> ------------------------------------------------------------
    > -----------------
    >>
    >> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
    >> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify
    > = TRUE)
    >> {
    >> FUN <- if (!is.null(FUN)) match.fun(FUN)
    >> if (!is.list(INDEX)) INDEX <- list(INDEX)
    >> @@ -44,7 +44,7 @@
    >> index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
    >> ans <- lapply(X = ans[index], FUN = FUN, ...)
    >> if (simplify && all(lengths(ans) == 1L)) {
    >> -       ansmat <- array(dim = extent, dimnames = namelist)
    >> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
    >> ans <- unlist(ans, recursive = FALSE)
    >> } else {
    >> ansmat <- array(vector("list", prod(extent)),
    >>
    >> ------------------------------------------------------------
    > -----------------
    >>
    >> With that, I can set the initial value to '0' instead of array's
    >> default of NA :
    >>
    >>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
    >> A B C D E F
    >> 1 NA 6 0 0 0 0
    >> 2  0 0 3 6 0 0
    >> 3  0 0 0 0 6 6
    >>>
    >>
    >> which now has 0 counts and NA  as is desirable to be used inside
    >> xtabs().
    >>
    >> All fine... and would not be worth a posting to R-devel,
    >> except for this:
    >>
    >> The change will not be 100% back compatible -- by necessity: any new
    > argument for
    >> tapply() will make that argument name not available to be
    >> specified (via '...') for 'FUN'.  The new function would be
    >>
    >>> str(tapply)
    >> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
    >>
    >> where the '...' are passed FUN(),  and with the new signature,
    >> 'init.value' then won't be passed to FUN  "anymore" (compared to
    >> R <= 3.3.x).
    >>
    >> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
    >> the probability the arg name is used in other functions).
    >>
    >>
    >> Opinions?
    >>
    >> Thank you in advance,
    >> Martin
    >>
    >> ______________________________________________
    >> [hidden email] mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

    > [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Undefined behavior of head() and tail() with n = 0

florent.angly@gmail.com
In reply to this post by R devel mailing list
Martin, I agree with you that +0 and -0 should generally be treated as
equal, and R does a fine job in this respect. The Wikipedia article on
signed zero (https://en.wikipedia.org/wiki/Signed_zero) echoes this
view but also highlights that +0 and -0 can be treated differently in
particular situations, including their interpretation as mathematical
limits (as in the 1/-0 case). Indeed, the main question here is
whether head() and tail() represent a special case that would benefit
from differentiating between +0 and -0.

We can break down the discussion into two problems:
A/ the discrepancy between the implementation of R head() and tail()
and the documentation of these functions (where the use of zero is not
documented and thus not permissible),
B/ the discrepancy between the implementation of R head() and tail()
and their GNU equivalent (which allow zeros and differentiate between
-0 and +0, i.e. head takes "0" and "-0", tail takes "0" and "+0").

There are several possible solutions to address these discrepancies:

1/ Leave the code as-is but document its behavior with respect to zero
(zeros allowed, with negative zeros treated like positive zeros).
Advantages: This is the path of least resistance, and discrepancy A is fixed.
Disadvantages: Discrepancy B remains (but is documented).

2/ Leave the documentation as-is but reflect this in code by not
allowing zeros at all.
Advantages: Discrepancy A is fixed.
Disadvantages: Discrepancy B remains in some form (but is documented).
Need to deprecate the usage of +0 (which was not clearly documented
but may have been assumed by users).

3/ Update the code and documentation to differentiate between +0 and -0.
Advantages: In my eyes, this is the ideal solution since discrepancy A
and (most of) B are resolved.
Disadvantages: It is unclear how to implement this solution and the
implications it may have on backward compatibility:
   a/ Allow -0 (as double). But is it supported on all platforms used
by R (see ?Arithmetic)? William has raised the issue that negative
zero cannot be represented as an integer. Should head() and tail()
then strictly check double input (while forbidding integers)?
   b/ The input could always be as character. This would allow to
mirror even more closely GNU tail (where the prefix "+" is used to
invert the meaning of n). This probably involves a fair amount of work
and careful handling of deprecation.



On 26 January 2017 at 16:51, William Dunlap <[hidden email]> wrote:

> In addition, signed zeroes only exist for floating point numbers - the
> bit patterns for as.integer(0) and as.integer(-0) are identical.
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, Jan 26, 2017 at 1:53 AM, Martin Maechler
> <[hidden email]> wrote:
>>>>>>> Florent Angly <[hidden email]>
>>>>>>>     on Wed, 25 Jan 2017 16:31:45 +0100 writes:
>>
>>     > Hi all,
>>     > The documentation for head() and tail() describes the behavior of
>>     > these generic functions when n is strictly positive (n > 0) and
>>     > strictly negative (n < 0). How these functions work when given a zero
>>     > value is not defined.
>>
>>     > Both GNU command-line utilities head and tail behave differently with +0 and -0:
>>     > http://man7.org/linux/man-pages/man1/head.1.html
>>     > http://man7.org/linux/man-pages/man1/tail.1.html
>>
>>     > Since R supports signed zeros (1/+0 != 1/-0)
>>
>> whoa, whoa, .. slow down --  The above is misleading!
>>
>> Rather read in  ?Arithmetic (*the* reference to consult for such issues),
>> where the 2nd part of the following section
>>
>>  || Implementation limits:
>>  ||
>>  ||      [..............]
>>  ||
>>  ||      Another potential issue is signed zeroes: on IEC 60659 platforms
>>  ||      there are two zeroes with internal representations differing by
>>  ||      sign.  Where possible R treats them as the same, but for example
>>  ||      direct output from C code often does not do so and may output
>>  ||      ‘-0.0’ (and on Windows whether it does so or not depends on the
>>  ||      version of Windows).  One place in R where the difference might be
>>  ||      seen is in division by zero: ‘1/x’ is ‘Inf’ or ‘-Inf’ depending on
>>  ||      the sign of zero ‘x’.  Another place is ‘identical(0, -0, num.eq =
>>  ||      FALSE)’.
>>
>> says the *contrary* ( __Where possible R treats them as the same__ ):
>> We do _not_ want to distinguish -0 and +0,
>> but there are cases where it is inavoidable
>>
>> And there are good reasons (mathematics !!) for this.
>>
>> I'm pretty sure that it would be quite a mistake to start
>> differentiating it here...  but of course we can continue
>> discussing here if you like.
>>
>> Martin Maechler
>> ETH Zurich and R Core
>>
>>
>>     > and the R head() and tail() functions are modeled after
>>     > their GNU counterparts, I would expect the R functions to
>>     > distinguish between +0 and -0
>>
>>     >> tail(1:5, n=0)
>>     > integer(0)
>>     >> tail(1:5, n=1)
>>     > [1] 5
>>     >> tail(1:5, n=2)
>>     > [1] 4 5
>>
>>     >> tail(1:5, n=-2)
>>     > [1] 3 4 5
>>     >> tail(1:5, n=-1)
>>     > [1] 2 3 4 5
>>     >> tail(1:5, n=-0)
>>     > integer(0)  # expected 1:5
>>
>>     >> head(1:5, n=0)
>>     > integer(0)
>>     >> head(1:5, n=1)
>>     > [1] 1
>>     >> head(1:5, n=2)
>>     > [1] 1 2
>>
>>     >> head(1:5, n=-2)
>>     > [1] 1 2 3
>>     >> head(1:5, n=-1)
>>     > [1] 1 2 3 4
>>     >> head(1:5, n=-0)
>>     > integer(0)  # expected 1:5
>>
>>     > For both head() and tail(), I expected 1:5 as output but got
>>     > integer(0). I obtained similar results using a data.frame and a
>>     > function as x argument.
>>
>>     > An easy fix would be to explicitly state in the documentation what n =
>>     > 0 does, and that there is no practical difference between -0 and +0.
>>     > However, in my eyes, the better approach would be implement support
>>     > for -0 and document it. What do you think?
>>
>>     > Best,
>>
>>     > Florent
>>
>>
>>     > PS/ My sessionInfo() gives:
>>     > R version 3.3.2 (2016-10-31)
>>     > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>     > Running under: Windows 7 x64 (build 7601) Service Pack 1
>>
>>     > locale:
>>     > [1] LC_COLLATE=German_Switzerland.1252
>>     > LC_CTYPE=German_Switzerland.1252
>>     > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
>>     > LC_TIME=German_Switzerland.1252
>>
>>     > attached base packages:
>>     > [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>     > ______________________________________________
>>     > [hidden email] mailing list
>>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Undefined behavior of head() and tail() with n = 0

Martin Maechler
Dear Florent,

thank you for striving to clearly disentangle and present the
issue below.
That is a nice "role model" way of approaching such topics!

>>>>> Florent Angly <[hidden email]>
>>>>>     on Fri, 27 Jan 2017 10:24:39 +0100 writes:

    > Martin, I agree with you that +0 and -0 should generally be treated as
    > equal, and R does a fine job in this respect. The Wikipedia article on
    > signed zero (https://en.wikipedia.org/wiki/Signed_zero) echoes this
    > view but also highlights that +0 and -0 can be treated differently in
    > particular situations, including their interpretation as mathematical
    > limits (as in the 1/-0 case). Indeed, the main question here is
    > whether head() and tail() represent a special case that would benefit
    > from differentiating between +0 and -0.

    > We can break down the discussion into two problems:
    > A/ the discrepancy between the implementation of R head() and tail()
    > and the documentation of these functions (where the use of zero is not
    > documented and thus not permissible),

Ehm, no, in R (and many other software systems),

  "not documented" does *NOT* entail "not permissible"


    > B/ the discrepancy between the implementation of R head() and tail()
    > and their GNU equivalent (which allow zeros and differentiate between
    > -0 and +0, i.e. head takes "0" and "-0", tail takes "0" and "+0").

This discrepancy, as you mention later comes from the fact that
basically, these arguments are strings in the Unix tools (GNU being a
special case of Unix, here) and integers in R.

Below, I'm giving my personal view of the issue:

    > There are several possible solutions to address these discrepancies:

    > 1/ Leave the code as-is but document its behavior with respect to zero
    > (zeros allowed, with negative zeros treated like positive zeros).
    > Advantages: This is the path of least resistance, and discrepancy A is fixed.
    > Disadvantages: Discrepancy B remains (but is documented).

That would be my "clear" choice.


    > 2/ Leave the documentation as-is but reflect this in code by not
    > allowing zeros at all.
    > Advantages: Discrepancy A is fixed.
    > Disadvantages: Discrepancy B remains in some form (but is documented).
    > Need to deprecate the usage of +0 (which was not clearly documented
    > but may have been assumed by users).

2/ looks "uniformly inferior" to 1/ to me


    > 3/ Update the code and documentation to differentiate between +0 and -0.
    > Advantages: In my eyes, this is the ideal solution since discrepancy A
    > and (most of) B are resolved.
    > Disadvantages: It is unclear how to implement this solution and the
    > implications it may have on backward compatibility:
    > a/ Allow -0 (as double). But is it supported on all platforms used
    > by R (see ?Arithmetic)? William has raised the issue that negative
    > zero cannot be represented as an integer. Should head() and tail()
    > then strictly check double input (while forbidding integers)?
    > b/ The input could always be as character. This would allow to
    > mirror even more closely GNU tail (where the prefix "+" is used to
    > invert the meaning of n). This probably involves a fair amount of work
    > and careful handling of deprecation.

3/ involves quite a few complications, and in my view, your
   advantages are not even getting close to counter-weigh the drawbacks.


    > On 26 January 2017 at 16:51, William Dunlap <[hidden email]> wrote:
    >> In addition, signed zeroes only exist for floating point numbers - the
    >> bit patterns for as.integer(0) and as.integer(-0) are identical.

indeed!

    >> Bill Dunlap
    >> TIBCO Software
    >> wdunlap tibco.com
    >>
    >>
    >> On Thu, Jan 26, 2017 at 1:53 AM, Martin Maechler
    >> <[hidden email]> wrote:
    >>>>>>>> Florent Angly <[hidden email]>
    >>>>>>>> on Wed, 25 Jan 2017 16:31:45 +0100 writes:
    >>>
    >>> > Hi all,
    >>> > The documentation for head() and tail() describes the behavior of
    >>> > these generic functions when n is strictly positive (n > 0) and
    >>> > strictly negative (n < 0). How these functions work when given a zero
    >>> > value is not defined.
    >>>
    >>> > Both GNU command-line utilities head and tail behave differently with +0 and -0:
    >>> > http://man7.org/linux/man-pages/man1/head.1.html
    >>> > http://man7.org/linux/man-pages/man1/tail.1.html
    >>>
    >>> > Since R supports signed zeros (1/+0 != 1/-0)
    >>>
    >>> whoa, whoa, .. slow down --  The above is misleading!
    >>>
    >>> Rather read in  ?Arithmetic (*the* reference to consult for such issues),
    >>> where the 2nd part of the following section
    >>>
    >>> || Implementation limits:
    >>> ||
    >>> ||      [..............]
    >>> ||
    >>> ||      Another potential issue is signed zeroes: on IEC 60659 platforms
    >>> ||      there are two zeroes with internal representations differing by
    >>> ||      sign.  Where possible R treats them as the same, but for example
    >>> ||      direct output from C code often does not do so and may output
    >>> ||      ‘-0.0’ (and on Windows whether it does so or not depends on the
    >>> ||      version of Windows).  One place in R where the difference might be
    >>> ||      seen is in division by zero: ‘1/x’ is ‘Inf’ or ‘-Inf’ depending on
    >>> ||      the sign of zero ‘x’.  Another place is ‘identical(0, -0, num.eq =
    >>> ||      FALSE)’.
    >>>
    >>> says the *contrary* ( __Where possible R treats them as the same__ ):
    >>> We do _not_ want to distinguish -0 and +0,
    >>> but there are cases where it is inavoidable
    >>>
    >>> And there are good reasons (mathematics !!) for this.
    >>>
    >>> I'm pretty sure that it would be quite a mistake to start
    >>> differentiating it here...  but of course we can continue
    >>> discussing here if you like.
    >>>
    >>> Martin Maechler
    >>> ETH Zurich and R Core
    >>>
    >>>
    >>> > and the R head() and tail() functions are modeled after
    >>> > their GNU counterparts, I would expect the R functions to
    >>> > distinguish between +0 and -0
    >>>
    >>> >> tail(1:5, n=0)
    >>> > integer(0)
    >>> >> tail(1:5, n=1)
    >>> > [1] 5
    >>> >> tail(1:5, n=2)
    >>> > [1] 4 5
    >>>
    >>> >> tail(1:5, n=-2)
    >>> > [1] 3 4 5
    >>> >> tail(1:5, n=-1)
    >>> > [1] 2 3 4 5
    >>> >> tail(1:5, n=-0)
    >>> > integer(0)  # expected 1:5
    >>>
    >>> >> head(1:5, n=0)
    >>> > integer(0)
    >>> >> head(1:5, n=1)
    >>> > [1] 1
    >>> >> head(1:5, n=2)
    >>> > [1] 1 2
    >>>
    >>> >> head(1:5, n=-2)
    >>> > [1] 1 2 3
    >>> >> head(1:5, n=-1)
    >>> > [1] 1 2 3 4
    >>> >> head(1:5, n=-0)
    >>> > integer(0)  # expected 1:5
    >>>
    >>> > For both head() and tail(), I expected 1:5 as output but got
    >>> > integer(0). I obtained similar results using a data.frame and a
    >>> > function as x argument.
    >>>
    >>> > An easy fix would be to explicitly state in the documentation what n =
    >>> > 0 does, and that there is no practical difference between -0 and +0.
    >>> > However, in my eyes, the better approach would be implement support
    >>> > for -0 and document it. What do you think?
    >>>
    >>> > Best,
    >>>
    >>> > Florent
    >>>
    >>>
    >>> > PS/ My sessionInfo() gives:
    >>> > R version 3.3.2 (2016-10-31)
    >>> > Platform: x86_64-w64-mingw32/x64 (64-bit)
    >>> > Running under: Windows 7 x64 (build 7601) Service Pack 1
    >>>
    >>> > locale:
    >>> > [1] LC_COLLATE=German_Switzerland.1252
    >>> > LC_CTYPE=German_Switzerland.1252
    >>> > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
    >>> > LC_TIME=German_Switzerland.1252
    >>>
    >>> > attached base packages:
    >>> > [1] stats     graphics  grDevices utils     datasets  methods   base
    >>>
    >>> > ______________________________________________
    >>> > [hidden email] mailing list
    >>> > https://stat.ethz.ch/mailman/listinfo/r-devel
    >>>
    >>> ______________________________________________
    >>> [hidden email] mailing list
    >>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Gabor Grothendieck
In reply to this post by Martin Maechler
If xtabs is enhanced then as.data.frame.table may also need to be
modified so that it continues to be usable as an inverse, at least to
the degree feasible.


On Thu, Jan 26, 2017 at 5:42 AM, Martin Maechler
<[hidden email]> wrote:

> Last week, we've talked here about "xtabs(), factors and NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
> 2) all other NAs come from the fact that there is no such factor combination
>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
> -----------------------------------------------------------------------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
> -----------------------------------------------------------------------------
>
> With that, I can set the initial value to '0' instead of array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore" (compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

R devel mailing list
In reply to this post by Martin Maechler
The "no factor combination" case is distinguishable by 'tapply' with simplify=FALSE.

> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)
> D2 <- D2[-c(1,5), ]
> DN <- D2; DN[1,"N"] <- NA
> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))
  A    B    C    D    E    F
1 NA   6    NULL NULL NULL NULL
2 NULL NULL 3    6    NULL NULL
3 NULL NULL NULL NULL 6    6


There is an old related discussion starting on https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html .

----------------------------------
Last week, we've talked here about "xtabs(), factors and NAs",
 ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".

In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses  tapply(...,  FUN = sum)   and
I now think there is a missing feature in tapply() there, which
I am proposing to change.

Look at a small example:

> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
, , N = 3

   L
n   A B C D E F
  1 1 2 0 0 0 0
  2 0 0 1 2 0 0
  3 0 0 0 0 2 2

> DN <- D2; DN[1,"N"] <- NA; DN
   n L  N
2  1 A NA
3  1 B  3
4  1 B  3
6  2 C  3
7  2 D  3
8  2 D  3
9  3 E  3
10 3 E  3
11 3 F  3
12 3 F  3
> with(DN, tapply(N, list(n,L), FUN=sum))
   A  B  C  D  E  F
1 NA  6 NA NA NA NA
2 NA NA  3  6 NA NA
3 NA NA NA NA  6  6
>  

and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:

1) at ["1", "A"], the  NA  comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
   *and* from the fact that tapply() uses

   array(dim = .., dimnames = ...)

i.e., initializes the array with NAs  (see definition of 'array').

My proposition is the following patch to  tapply(), adding a new
option 'init.value':

-----------------------------------------------------------------------------
 
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
 {
     FUN <- if (!is.null(FUN)) match.fun(FUN)
     if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
     index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
     ans <- lapply(X = ans[index], FUN = FUN, ...)
     if (simplify && all(lengths(ans) == 1L)) {
- ansmat <- array(dim = extent, dimnames = namelist)
+ ansmat <- array(init.value, dim = extent, dimnames = namelist)
  ans <- unlist(ans, recursive = FALSE)
     } else {
  ansmat <- array(vector("list", prod(extent)),

-----------------------------------------------------------------------------

With that, I can set the initial value to '0' instead of array's
default of NA :

> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
   A B C D E F
1 NA 6 0 0 0 0
2  0 0 3 6 0 0
3  0 0 0 0 6 6
>

which now has 0 counts and NA  as is desirable to be used inside
xtabs().

All fine... and would not be worth a posting to R-devel,
except for this:

The change will not be 100% back compatible -- by necessity: any new argument for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'.  The new function would be

> str(tapply)
function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

where the '...' are passed FUN(),  and with the new signature,
'init.value' then won't be passed to FUN  "anymore" (compared to
R <= 3.3.x).

For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
the probability the arg name is used in other functions).


Opinions?

Thank you in advance,
Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Martin Maechler
>>>>> Suharto Anggono Suharto Anggono via R-devel <[hidden email]>
>>>>>     on Fri, 27 Jan 2017 16:36:59 +0000 writes:

    > The "no factor combination" case is distinguishable by 'tapply' with simplify=FALSE.
    >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)
    >> D2 <- D2[-c(1,5), ]
    >> DN <- D2; DN[1,"N"] <- NA
    >> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))
    > A    B    C    D    E    F
    > 1 NA   6    NULL NULL NULL NULL
    > 2 NULL NULL 3    6    NULL NULL
    > 3 NULL NULL NULL NULL 6    6

Yes, I know that simplify=FALSE  behaves differently, it returns
a list with dim & dimnames, sometimes also called a "list - matrix"
... and it *can* be used instead, but to be useful needs to be
post processed and that overall is somewhat inefficient and ugly.


    > There is an old related discussion starting on https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html .

Thank you, indeed, for finding that. There Andrew Robinson did
raise the same issue, but his proposed solution was not much
back compatible and I think was primarily dismissed because of that.

Martin

    > ----------------------------------
    > Last week, we've talked here about "xtabs(), factors and NAs",
    -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

    > In the mean time, I've spent several hours on the issue
    > and also committed changes to R-devel "in two iterations".

    > In the case there is a *Left* hand side part to xtabs() formula,
    > see the help page example using 'esoph',
    > it uses  tapply(...,  FUN = sum)   and
    > I now think there is a missing feature in tapply() there, which
    > I am proposing to change.

    > Look at a small example:

    >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
    > , , N = 3

    > L
    > n   A B C D E F
    > 1 1 2 0 0 0 0
    > 2 0 0 1 2 0 0
    > 3 0 0 0 0 2 2

    >> DN <- D2; DN[1,"N"] <- NA; DN
    > n L  N
    > 2  1 A NA
    > 3  1 B  3
    > 4  1 B  3
    > 6  2 C  3
    > 7  2 D  3
    > 8  2 D  3
    > 9  3 E  3
    > 10 3 E  3
    > 11 3 F  3
    > 12 3 F  3
    >> with(DN, tapply(N, list(n,L), FUN=sum))
    > A  B  C  D  E  F
    > 1 NA  6 NA NA NA NA
    > 2 NA NA  3  6 NA NA
    > 3 NA NA NA NA  6  6
    >>

    > and as you can see, the resulting matrix has NAs, all the same
    > NA_real_, but semantically of two different kinds:

    > 1) at ["1", "A"], the  NA  comes from the NA in 'N'
    > 2) all other NAs come from the fact that there is no such factor combination
    > *and* from the fact that tapply() uses

    > array(dim = .., dimnames = ...)

    > i.e., initializes the array with NAs  (see definition of 'array').

    > My proposition is the following patch to  tapply(), adding a new
    > option 'init.value':

    > -----------------------------------------------------------------------------
 
    > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
    > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
    > {
    > FUN <- if (!is.null(FUN)) match.fun(FUN)
    > if (!is.list(INDEX)) INDEX <- list(INDEX)
    > @@ -44,7 +44,7 @@
    > index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
    > ans <- lapply(X = ans[index], FUN = FUN, ...)
    > if (simplify && all(lengths(ans) == 1L)) {
    > - ansmat <- array(dim = extent, dimnames = namelist)
    > + ansmat <- array(init.value, dim = extent, dimnames = namelist)
    > ans <- unlist(ans, recursive = FALSE)
    > } else {
    > ansmat <- array(vector("list", prod(extent)),

    > -----------------------------------------------------------------------------

    > With that, I can set the initial value to '0' instead of array's
    > default of NA :

    >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
    > A B C D E F
    > 1 NA 6 0 0 0 0
    > 2  0 0 3 6 0 0
    > 3  0 0 0 0 6 6
    >>

    > which now has 0 counts and NA  as is desirable to be used inside
    > xtabs().

    > All fine... and would not be worth a posting to R-devel,
    > except for this:

    > The change will not be 100% back compatible -- by necessity: any new argument for
    > tapply() will make that argument name not available to be
    > specified (via '...') for 'FUN'.  The new function would be

    >> str(tapply)
    > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

    > where the '...' are passed FUN(),  and with the new signature,
    > 'init.value' then won't be passed to FUN  "anymore" (compared to
    > R <= 3.3.x).

    > For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
    > the probability the arg name is used in other functions).


    > Opinions?

    > Thank you in advance,
    > Martin

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Henrik Bengtsson-5
In reply to this post by Martin Maechler
On Fri, Jan 27, 2017 at 12:34 AM, Martin Maechler
<[hidden email]> wrote:

>
>     > On Jan 26, 2017 07:50, "William Dunlap via R-devel" <[hidden email]>
>     > wrote:
>
>     > It would be cool if the default for tapply's init.value could be
>     > FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
>     > FUN=all, -Inf for FUN=max, etc.  But that would take time and would
>     > break code for which FUN did not work on length-0 objects.
>
>     > Bill Dunlap
>     > TIBCO Software
>     > wdunlap tibco.com
>
> I had the same idea (after my first post), so I agree that would
> be nice. One could argue it would take time only if the user is too lazy
> to specify the value,  and we could use
>    tryCatch(FUN(X[0]), error = NA)
> to safeguard against those functions that fail for 0 length arg.
>
> But I think the main reason for _not_ setting such a default is
> back-compatibility.  In my proposal, the new argument would not
> be any change by default and so all current uses of tapply()
> would remain unchanged.
>
>>>>>> Henrik Bengtsson <[hidden email]>
>>>>>>     on Thu, 26 Jan 2017 07:57:08 -0800 writes:
>
>     > On a related note, the storage mode should try to match ans[[1]] (or
>     > unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full
>     > copy.
>
> Yes, related indeed; and would fall "in line" with Bill's idea.
> OTOH, it could be implemented independently,
> by something like
>
>    if(missing(init.value))
>      init.value <-
>        if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]]))
>        else NA

I would probably do something like:

  ans <- unlist(ans, recursive = FALSE, use.names = FALSE)
  if (length(ans)) storage.mode(init.value) <- storage.mode(ans[[1]])
  ansmat <- array(init.value, dim = extent, dimnames = namelist)

instead.  That completely avoids having to use missing() and the value
of 'init.value' will be coerced later if not done upfront.  use.names
= FALSE speeds up unlist().

/Henrik

>
> .............
>
> A colleague proposed to use the shorter argument name 'default'
> instead of 'init.value'  which indeed maybe more natural and
> still not too often used as "non-first" argument in  FUN(.).
>
> Thank you for the constructive feedback!
> Martin
>
>     > On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
>     > <[hidden email]> wrote:
>     >> Last week, we've talked here about "xtabs(), factors and NAs",
>     -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>     >>
>     >> In the mean time, I've spent several hours on the issue
>     >> and also committed changes to R-devel "in two iterations".
>     >>
>     >> In the case there is a *Left* hand side part to xtabs() formula,
>     >> see the help page example using 'esoph',
>     >> it uses  tapply(...,  FUN = sum)   and
>     >> I now think there is a missing feature in tapply() there, which
>     >> I am proposing to change.
>     >>
>     >> Look at a small example:
>     >>
>     >>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
>     > N=3)[-c(1,5), ]; xtabs(~., D2)
>     >> , , N = 3
>     >>
>     >> L
>     >> n   A B C D E F
>     >> 1 1 2 0 0 0 0
>     >> 2 0 0 1 2 0 0
>     >> 3 0 0 0 0 2 2
>     >>
>     >>> DN <- D2; DN[1,"N"] <- NA; DN
>     >> n L  N
>     >> 2  1 A NA
>     >> 3  1 B  3
>     >> 4  1 B  3
>     >> 6  2 C  3
>     >> 7  2 D  3
>     >> 8  2 D  3
>     >> 9  3 E  3
>     >> 10 3 E  3
>     >> 11 3 F  3
>     >> 12 3 F  3
>     >>> with(DN, tapply(N, list(n,L), FUN=sum))
>     >> A  B  C  D  E  F
>     >> 1 NA  6 NA NA NA NA
>     >> 2 NA NA  3  6 NA NA
>     >> 3 NA NA NA NA  6  6
>     >>>
>     >>
>     >> and as you can see, the resulting matrix has NAs, all the same
>     >> NA_real_, but semantically of two different kinds:
>     >>
>     >> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
>     >> 2) all other NAs come from the fact that there is no such factor
>     > combination
>     >> *and* from the fact that tapply() uses
>     >>
>     >> array(dim = .., dimnames = ...)
>     >>
>     >> i.e., initializes the array with NAs  (see definition of 'array').
>     >>
>     >> My proposition is the following patch to  tapply(), adding a new
>     >> option 'init.value':
>     >>
>     >> ------------------------------------------------------------
>     > -----------------
>     >>
>     >> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
>     >> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify
>     > = TRUE)
>     >> {
>     >> FUN <- if (!is.null(FUN)) match.fun(FUN)
>     >> if (!is.list(INDEX)) INDEX <- list(INDEX)
>     >> @@ -44,7 +44,7 @@
>     >> index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
>     >> ans <- lapply(X = ans[index], FUN = FUN, ...)
>     >> if (simplify && all(lengths(ans) == 1L)) {
>     >> -       ansmat <- array(dim = extent, dimnames = namelist)
>     >> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>     >> ans <- unlist(ans, recursive = FALSE)
>     >> } else {
>     >> ansmat <- array(vector("list", prod(extent)),
>     >>
>     >> ------------------------------------------------------------
>     > -----------------
>     >>
>     >> With that, I can set the initial value to '0' instead of array's
>     >> default of NA :
>     >>
>     >>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>     >> A B C D E F
>     >> 1 NA 6 0 0 0 0
>     >> 2  0 0 3 6 0 0
>     >> 3  0 0 0 0 6 6
>     >>>
>     >>
>     >> which now has 0 counts and NA  as is desirable to be used inside
>     >> xtabs().
>     >>
>     >> All fine... and would not be worth a posting to R-devel,
>     >> except for this:
>     >>
>     >> The change will not be 100% back compatible -- by necessity: any new
>     > argument for
>     >> tapply() will make that argument name not available to be
>     >> specified (via '...') for 'FUN'.  The new function would be
>     >>
>     >>> str(tapply)
>     >> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>     >>
>     >> where the '...' are passed FUN(),  and with the new signature,
>     >> 'init.value' then won't be passed to FUN  "anymore" (compared to
>     >> R <= 3.3.x).
>     >>
>     >> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
>     >> the probability the arg name is used in other functions).
>     >>
>     >>
>     >> Opinions?
>     >>
>     >> Thank you in advance,
>     >> Martin
>     >>
>     >> ______________________________________________
>     >> [hidden email] mailing list
>     >> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>     > [[alternative HTML version deleted]]
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Martin Maechler
>>>>> Henrik Bengtsson <[hidden email]>
>>>>>     on Fri, 27 Jan 2017 09:46:15 -0800 writes:

    > On Fri, Jan 27, 2017 at 12:34 AM, Martin Maechler
    > <[hidden email]> wrote:
    >>
    >> > On Jan 26, 2017 07:50, "William Dunlap via R-devel"
    >> <[hidden email]> > wrote:
    >>
    >> > It would be cool if the default for tapply's init.value
    >> could be > FUN(X[0]), so it would be 0 for FUN=sum or
    >> FUN=length, TRUE for > FUN=all, -Inf for FUN=max, etc.
    >> But that would take time and would > break code for which
    >> FUN did not work on length-0 objects.
    >>
    >> > Bill Dunlap > TIBCO Software > wdunlap tibco.com
    >>
    >> I had the same idea (after my first post), so I agree
    >> that would be nice. One could argue it would take time
    >> only if the user is too lazy to specify the value, and we
    >> could use tryCatch(FUN(X[0]), error = NA) to safeguard
    >> against those functions that fail for 0 length arg.
    >>
    >> But I think the main reason for _not_ setting such a
    >> default is back-compatibility.  In my proposal, the new
    >> argument would not be any change by default and so all
    >> current uses of tapply() would remain unchanged.
    >>
    >>>>>>> Henrik Bengtsson <[hidden email]> on
    >>>>>>> Thu, 26 Jan 2017 07:57:08 -0800 writes:
    >>
    >> > On a related note, the storage mode should try to match
    >> ans[[1]] (or > unlist:ed and) when allocating 'ansmat' to
    >> avoid coercion and hence a full > copy.
    >>
    >> Yes, related indeed; and would fall "in line" with Bill's
    >> idea.  OTOH, it could be implemented independently, by
    >> something like
    >>
    >> if(missing(init.value)) init.value <- if(length(ans))
    >> as.vector(NA, mode=storage.mode(ans[[1]])) else NA

> I would probably do something like:

>   ans <- unlist(ans, recursive = FALSE, use.names = FALSE)
>   if (length(ans)) storage.mode(init.value) <- storage.mode(ans[[1]])
>   ansmat <- array(init.value, dim = extent, dimnames = namelist)

> instead.  That completely avoids having to use missing() and the value
> of 'init.value' will be coerced later if not done upfront.  use.names
> = FALSE speeds up unlist().

Thank you, Henrik.
That's a good idea to do the unlist() first, and with 'use.names=FALSE'.
I'll copy that.

On the other hand, "brutally" modifying  'init.value' (now called 'default')
even when the user has specified it is not acceptable I think.
You are right that it would be coerced anyway subsequently, but
the coercion will happen in whatever method of  `[<-` will be
appropriate.
Good S3 and S4 programmers will write such methods for their classes.

For that reason, I'm even more conservative now, only fiddle in
case of an atomic 'ans' and make use of the corresponding '['
method rather than as.vector(.) ... because that will fulfill
the following new regression test {not fulfilled in current R}:

identical(tapply(1:3, 1:3, as.raw),
          array(as.raw(1:3), 3L, dimnames=list(1:3)))

Also, I've done a few more things -- treating if(.) . else . as a
function call, etc  and now committed as  rev 72040  to
R-devel... really wanting to get this out.

We can bet if there will be ripples in (visible) package space,
I give it relatively high chance for no ripples (and much higher
chance for problems with the more aggressive proposal..)

Thank you again, for your "thinking along" and constructive
suggestions.

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

R devel mailing list
In reply to this post by Martin Maechler
Function 'aggregate.data.frame' in R has taken a different route. With drop=FALSE, the function is also applied to subset corresponding to combination of grouping variables that doesn't appear in the data (example 2 in https://stat.ethz.ch/pipermail/r-devel/2017-January/073678.html).

Because 'default' is used only when simplification happens, putting 'default' after 'simplify' in the argument list may be more logical. Anyway, it doesn't affect call to 'tapply' because the argument 'default' must be specified by name.

With the code using
if(missing(default)) ,
I consider the stated default value of 'default',
default = NA ,
misleading because the code doesn't use it. Also,
tapply(1:3, 1:3, as.raw)
is not the same as
tapply(1:3, 1:3, as.raw, default = NA) .
The accurate statement is the code in
if(missing(default)) ,
but it involves the local variable 'ans'.

As far as I know, the result of function 'array' in is not a classed object and the default method of  `[<-` will be used in the 'tapply' code portion.

As far as I know, the result of 'lapply' is a list without class. So, 'unlist' applied to it uses the default method and the 'unlist' result is a vector or a factor.

With the change, the result of
tapply(1:3, 1:3, factor, levels=3:1)
is of mode "character". The value is from the internal code, not from the factor levels. It is worse than before the change, where it is really the internal code, integer.
In the documentation, the description of argument 'simplify' says: "If 'TRUE' (the default), then if 'FUN' always returns a scalar, 'tapply' returns an array with the mode of the scalar."

To initialize array, a zero-length vector can also be used.

For 'xtabs', I think that it is better if the result has storage mode "integer" if 'sum' results are of storage mode "integer", as in R 3.3.2. As 'default' argument for 'tapply', 'xtabs' can use 0L, or use 0L or 0 depending on storage mode of the summed quantity.

----------------------------
>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>     on Fri, 27 Jan 2017 09:46:15 -0800 writes:

    > On Fri, Jan 27, 2017 at 12:34 AM, Martin Maechler
    > <maechler at stat.math.ethz.ch> wrote:
    >>
    >> > On Jan 26, 2017 07:50, "William Dunlap via R-devel"
    >> <r-devel at r-project.org> > wrote:
    >>
    >> > It would be cool if the default for tapply's init.value
    >> could be > FUN(X[0]), so it would be 0 for FUN=sum or
    >> FUN=length, TRUE for > FUN=all, -Inf for FUN=max, etc.
    >> But that would take time and would > break code for which
    >> FUN did not work on length-0 objects.
    >>
    >> > Bill Dunlap > TIBCO Software > wdunlap tibco.com
    >>
    >> I had the same idea (after my first post), so I agree
    >> that would be nice. One could argue it would take time
    >> only if the user is too lazy to specify the value, and we
    >> could use tryCatch(FUN(X[0]), error = NA) to safeguard
    >> against those functions that fail for 0 length arg.
    >>
    >> But I think the main reason for _not_ setting such a
    >> default is back-compatibility.  In my proposal, the new
    >> argument would not be any change by default and so all
    >> current uses of tapply() would remain unchanged.
    >>
    >>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> on
    >>>>>>> Thu, 26 Jan 2017 07:57:08 -0800 writes:
    >>
    >> > On a related note, the storage mode should try to match
    >> ans[[1]] (or > unlist:ed and) when allocating 'ansmat' to
    >> avoid coercion and hence a full > copy.
    >>
    >> Yes, related indeed; and would fall "in line" with Bill's
    >> idea.  OTOH, it could be implemented independently, by
    >> something like
    >>
    >> if(missing(init.value)) init.value <- if(length(ans))
    >> as.vector(NA, mode=storage.mode(ans[[1]])) else NA

> I would probably do something like:

>   ans <- unlist(ans, recursive = FALSE, use.names = FALSE)
>   if (length(ans)) storage.mode(init.value) <- storage.mode(ans[[1]])
>   ansmat <- array(init.value, dim = extent, dimnames = namelist)

> instead.  That completely avoids having to use missing() and the value
> of 'init.value' will be coerced later if not done upfront.  use.names
> = FALSE speeds up unlist().

Thank you, Henrik.
That's a good idea to do the unlist() first, and with 'use.names=FALSE'.
I'll copy that.

On the other hand, "brutally" modifying  'init.value' (now called 'default')
even when the user has specified it is not acceptable I think.
You are right that it would be coerced anyway subsequently, but
the coercion will happen in whatever method of  `[<-` will be
appropriate.
Good S3 and S4 programmers will write such methods for their classes.

For that reason, I'm even more conservative now, only fiddle in
case of an atomic 'ans' and make use of the corresponding '['
method rather than as.vector(.) ... because that will fulfill
the following new regression test {not fulfilled in current R}:

identical(tapply(1:3, 1:3, as.raw),
          array(as.raw(1:3), 3L, dimnames=list(1:3)))

Also, I've done a few more things -- treating if(.) . else . as a
function call, etc  and now committed as  rev 72040  to
R-devel... really wanting to get this out.

We can bet if there will be ripples in (visible) package space,
I give it relatively high chance for no ripples (and much higher
chance for problems with the more aggressive proposal..)

Thank you again, for your "thinking along" and constructive
suggestions.

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Martin Maechler
>>>>> Suharto Anggono Suharto Anggono via R-devel <[hidden email]>
>>>>>     on Tue, 31 Jan 2017 15:43:53 +0000 writes:

    > Function 'aggregate.data.frame' in R has taken a different route. With drop=FALSE, the function is also applied to subset corresponding to combination of grouping variables that doesn't appear in the data (example 2 in https://stat.ethz.ch/pipermail/r-devel/2017-January/073678.html).

Interesting point (I couldn't easily find 'the example 2' though).
However, aggregate.data.frame() is a considerably more
sophisticated function and one goal was to change tapply() as
little as possible for compatibility (and maintenance!) reasons .

> Because 'default' is used only when simplification happens, putting 'default' after 'simplify' in the argument list may be more logical.

Yes, from this point of view, you are right; I had thought about
that too; on the other hand, it belongs "closely" to the 'FUN'
and I think that's why I had decided not to change the proposal..

> Anyway, it doesn't affect call to 'tapply' because the argument 'default' must be specified by name.

Exactly.. so we keep the order as is.

    > With the code using
    >    if(missing(default)) ,
    > I consider the stated default value of 'default',
    >    default = NA ,
    > misleading because the code doesn't use it.

I know and I also had thought about it and decided to keep it
in the spirit of "self documentation" because  "in spirit", the
default still *is* NA.

    > Also,
    >  tapply(1:3, 1:3, as.raw)
    > is not the same as
    >  tapply(1:3, 1:3, as.raw, default = NA) .
    > The accurate statement is the code in
    > if(missing(default)) ,
    > but it involves the local variable 'ans'.

exactly.  But putting that whole expression in there would look
confusing to those using  str(tapply), args(tapply) or similar
inspection to quickly get a glimpse of the function user "interface".
That's why we typically don't do that and rather slightly cheat
with the formal default, for the above "didactical" purposes.

If you are puristic about this, then missing() should almost never
be used when the function argument has a formal default.

I don't have a too strong opinion here, and we do have quite a
few other cases, where the formal default argument is not always
used because of   if(missing(.))  clauses.

I think I could be convinced to drop the '= NA' from the formal
argument list..


    > As far as I know, the result of function 'array' in is not a classed object and the default method of  `[<-` will be used in the 'tapply' code portion.

    > As far as I know, the result of 'lapply' is a list without class. So, 'unlist' applied to it uses the default method and the 'unlist' result is a vector or a factor.

You may be right here
  ((or not:  If a package author makes array() into an S3 generic and defines
    S3method(array, *) and she or another make tapply() into a
    generic with methods,  are we really sure that this code
    would not be used ??))

still, the as.raw example did not easily work without a warning
when using as.vector() .. or similar.

    > With the change, the result of

    > tapply(1:3, 1:3, factor, levels=3:1)

    > is of mode "character". The value is from the internal code, not from the factor levels. It is worse than before the change, where it is really the internal code, integer.

I agree that this change is not desirable.
One could argue that it was quite a "lucky coincidence" that the previous
code returned the internal integer codes though..


    > In the documentation, the description of argument 'simplify' says: "If 'TRUE' (the default), then if 'FUN' always returns a scalar, 'tapply' returns an array with the mode of the scalar."


    > To initialize array, a zero-length vector can also be used.

yes, of course; but my  ans[0L][1L]  had the purpose to get the
correct mode specific version of NA .. which works for raw (by
getting '00' because "raw" has *no* NA!).

So it seems I need an additional   !is.factor(ans)  there ...
a bit ugly.


---------

> For 'xtabs', I think that it is better if the result has storage mode "integer" if 'sum' results are of storage mode "integer", as in R 3.3.2.

you are right, that *is* preferable

>  As 'default' argument for 'tapply', 'xtabs' can use 0L, or use 0L or 0 depending on storage mode of the summed quantity.

indeed, that will be an improvement there!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

R devel mailing list
In reply to this post by Martin Maechler
On 'aggregate data.frame', the URL should be https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html .

vector(typeof(ans))
(or  vector(storage.mode(ans)))
has length zero and can be used to initialize array.

Instead of
if(missing(default)) ,
if(identical(default, NA))
could be used. The documentation could then say, for example: "If default = NA (the default), NA of appropriate storage mode (0 for raw) is automatically used."
--------------------------------------------
On Wed, 1/2/17, Martin Maechler <[hidden email]> wrote:

 Subject: Re: [Rd] RFC: tapply(*, ..., init.value = NA)

 Cc: [hidden email]
 Date: Wednesday, 1 February, 2017, 12:14 AM
 
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Tue, 31 Jan 2017 15:43:53 +0000 writes:

    > Function 'aggregate.data.frame' in R has taken a different route. With drop=FALSE, the function is also applied to subset corresponding to combination of grouping variables that doesn't appear in the data (example 2 in https://stat.ethz.ch/pipermail/r-devel/2017-January/073678.html).

Interesting point (I couldn't easily find 'the example 2' though).
However, aggregate.data.frame() is a considerably more
sophisticated function and one goal was to change tapply() as
little as possible for compatibility (and maintenance!) reasons .

[snip]

    > With the code using
    >    if(missing(default)) ,
    > I consider the stated default value of 'default',
    >    default = NA ,
    > misleading because the code doesn't use it.

I know and I also had thought about it and decided to keep it
in the spirit of "self documentation" because  "in spirit", the
default still *is* NA.

    > Also,
    >  tapply(1:3, 1:3, as.raw)
    > is not the same as
    >  tapply(1:3, 1:3, as.raw, default = NA) .
    > The accurate statement is the code in
    > if(missing(default)) ,
    > but it involves the local variable 'ans'.

exactly.  But putting that whole expression in there would look
confusing to those using  str(tapply), args(tapply) or similar
inspection to quickly get a glimpse of the function user "interface".
That's why we typically don't do that and rather slightly cheat
with the formal default, for the above "didactical" purposes.

If you are puristic about this, then missing() should almost never
be used when the function argument has a formal default.

I don't have a too strong opinion here, and we do have quite a
few other cases, where the formal default argument is not always
used because of   if(missing(.))  clauses.

I think I could be convinced to drop the '= NA' from the formal
argument list..


    > As far as I know, the result of function 'array' in is not a classed object and the default method of  `[<-` will be used in the 'tapply' code portion.

    > As far as I know, the result of 'lapply' is a list without class. So, 'unlist' applied to it uses the default method and the 'unlist' result is a vector or a factor.

You may be right here
  ((or not:  If a package author makes array() into an S3 generic and defines
    S3method(array, *) and she or another make tapply() into a
    generic with methods,  are we really sure that this code
    would not be used ??))

still, the as.raw example did not easily work without a warning
when using as.vector() .. or similar.

    > With the change, the result of

    > tapply(1:3, 1:3, factor, levels=3:1)

    > is of mode "character". The value is from the internal code, not from the factor levels. It is worse than before the change, where it is really the internal code, integer.

I agree that this change is not desirable.
One could argue that it was quite a "lucky coincidence" that the previous
code returned the internal integer codes though..


[snip]


    > To initialize array, a zero-length vector can also be used.

yes, of course; but my  ans[0L][1L]  had the purpose to get the
correct mode specific version of NA .. which works for raw (by
getting '00' because "raw" has *no* NA!).

So it seems I need an additional   !is.factor(ans)  there ...
a bit ugly.


---------

[snip]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

Martin Maechler
>>>>> Suharto Anggono Suharto Anggono via R-devel <[hidden email]>
>>>>>     on Wed, 1 Feb 2017 16:17:06 +0000 writes:

    > On 'aggregate data.frame', the URL should be
    > https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html .

thank you. Yes, using 'drop' makes sense there where the result
is always "linear(ized)" or "one-dimensional".
For tapply() that's only the case for 1D-index.

    > vector(typeof(ans)) (or vector(storage.mode(ans))) has
    > length zero and can be used to initialize array.  

Yes,.. unless in the case where ans is NULL.
You have convinced me, that is  nicer.

    > Instead of if(missing(default)) , if(identical(default,
    > NA)) could be used. The documentation could then say, for
    > example: "If default = NA (the default), NA of appropriate
    > storage mode (0 for raw) is automatically used."

After some thought (and experiments), I have reverted and no
longer use if(missing). You are right that it is not needed
(and even potentially confusing) here.

Changes are in svn c72106.

Martin Maechler


    > --------------------------------------------
    > On Wed, 1/2/17, Martin Maechler
    > <[hidden email]> wrote:

    >  Subject: Re: [Rd] RFC: tapply(*, ..., init.value = NA)

    >  Cc: [hidden email] Date: Wednesday, 1 February,
    > 2017, 12:14 AM
 
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Tue, 31 Jan 2017 15:43:53 +0000 writes:

    >> Function 'aggregate.data.frame' in R has taken a
    >> different route. With drop=FALSE, the function is also
    >> applied to subset corresponding to combination of
    >> grouping variables that doesn't appear in the data
    >> (example 2 in
    >> https://stat.ethz.ch/pipermail/r-devel/2017-January/073678.html).

    > Interesting point (I couldn't easily find 'the example 2'
    > though).  However, aggregate.data.frame() is a
    > considerably more sophisticated function and one goal was
    > to change tapply() as little as possible for compatibility
    > (and maintenance!) reasons .

    > [snip]

    >> With the code using if(missing(default)) , I consider the
    >> stated default value of 'default', default = NA ,
    >> misleading because the code doesn't use it.

    > I know and I also had thought about it and decided to keep
    > it in the spirit of "self documentation" because "in
    > spirit", the default still *is* NA.

    >> Also, tapply(1:3, 1:3, as.raw) is not the same as
    >> tapply(1:3, 1:3, as.raw, default = NA) .  The accurate
    >> statement is the code in if(missing(default)) , but it
    >> involves the local variable 'ans'.

    > exactly.  But putting that whole expression in there would
    > look confusing to those using str(tapply), args(tapply) or
    > similar inspection to quickly get a glimpse of the
    > function user "interface".  That's why we typically don't
    > do that and rather slightly cheat with the formal default,
    > for the above "didactical" purposes.

    > If you are puristic about this, then missing() should
    > almost never be used when the function argument has a
    > formal default.

    > I don't have a too strong opinion here, and we do have
    > quite a few other cases, where the formal default argument
    > is not always used because of if(missing(.))  clauses.

    > I think I could be convinced to drop the '= NA' from the
    > formal argument list..


    >> As far as I know, the result of function 'array' in is
    >> not a classed object and the default method of `[<-` will
    >> be used in the 'tapply' code portion.

    >> As far as I know, the result of 'lapply' is a list
    >> without class. So, 'unlist' applied to it uses the
    >> default method and the 'unlist' result is a vector or a
    >> factor.

    > You may be right here ((or not: If a package author makes
    > array() into an S3 generic and defines S3method(array, *)
    > and she or another make tapply() into a generic with
    > methods, are we really sure that this code would not be
    > used ??))

    > still, the as.raw example did not easily work without a
    > warning when using as.vector() .. or similar.

    >> With the change, the result of

    >> tapply(1:3, 1:3, factor, levels=3:1)

    >> is of mode "character". The value is from the internal
    >> code, not from the factor levels. It is worse than before
    >> the change, where it is really the internal code,
    >> integer.

    > I agree that this change is not desirable.  One could
    > argue that it was quite a "lucky coincidence" that the
    > previous code returned the internal integer codes though..


    > [snip]


    >> To initialize array, a zero-length vector can also be
    >> used.

    > yes, of course; but my ans[0L][1L] had the purpose to get
    > the correct mode specific version of NA .. which works for
    > raw (by getting '00' because "raw" has *no* NA!).

    > So it seems I need an additional !is.factor(ans) there ...
    > a bit ugly.


    > ---------

    > [snip]

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: RFC: tapply(*, ..., init.value = NA)

R devel mailing list
In reply to this post by Martin Maechler
Function 'tapply' in R devel r72137 uses
if(!is.null(ans) && is.na(default) && is.atomic(ans)) .

Problems:
- It is possible that user-specified 'default' is not of length 1. If the length is zero, the 'if' gives an error.
- It is possible that is.na(default) is TRUE and user-specified 'default' is NaN.

--------------------------------------------
On Sat, 4/2/17, Martin Maechler <[hidden email]> wrote:

 Subject: Re: [Rd] RFC: tapply(*, ..., init.value = NA)

 Cc: [hidden email]
 Date: Saturday, 4 February, 2017, 10:48 PM
 
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Wed, 1 Feb 2017 16:17:06 +0000 writes:

[snip]

    > vector(typeof(ans)) (or vector(storage.mode(ans))) has
    > length zero and can be used to initialize array.  

Yes,.. unless in the case where ans is NULL.
You have convinced me, that is  nicer.

    > Instead of if(missing(default)) , if(identical(default,
    > NA)) could be used. The documentation could then say, for
    > example: "If default = NA (the default), NA of appropriate
    > storage mode (0 for raw) is automatically used."

After some thought (and experiments), I have reverted and no
longer use if(missing). You are right that it is not needed
(and even potentially confusing) here.

Changes are in svn c72106.

Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel