Corrupt internal row names when creating a data.frame with `attributes<-`

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Corrupt internal row names when creating a data.frame with `attributes<-`

Davis Vaughan
Hi all,

I believe that the internal row names object created at this line in
`row_names_gets()` should be using `-n`, not `n`.
https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71

This can currently generate corrupt internal row names when using
`attributes<-` or `structure()`, which calls `attributes<-`.

# internal row names are typically `c(NA, -n)`
df <- data.frame(x = 1:3)
.row_names_info(df, type = 0L)
#> [1] NA -3

# using `attributes()` materializes their non-internal form
attrs <- attributes(df)
attrs
#> $names
#> [1] "x"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3

# let's make a data frame from scratch with `attributes<-`
data <- list(x = 1:3)
attributes(data) <- attrs

# oh no!
.row_names_info(data, type = 0L)
#> [1] NA  3

# Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
# internal row names are not attempted to be created in the C level
# `row_names_gets()`

Thanks,
Davis

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt internal row names when creating a data.frame with `attributes<-`

Kevin Ushey
Strictly speaking, I don't think this is a "corrupt" representation,
given that any APIs used to access that internal representation will
call abs() on the row count encoded within. At least, as far as I can
tell, there aren't any adverse downstream effects from having the row
names attribute encoded with this particular internal representation.

On the other hand, the documentation in ?.row_names_info states, for
the 'type' argument:

integer. Currently type = 0 returns the internal "row.names" attribute
(possibly NULL), type = 2 the number of rows implied by the attribute,
and type = 1 the latter with a negative sign for ‘automatic’ row
names.

so one could argue that it's incorrect in light of that documentation
(the row names are "automatic", but the row count is not marked with a
negative sign). Or perhaps this is a different "type" of internal
automatic row name, since it was generated from an already-existing
integer sequence rather than "automatically" in a call to
data.frame().

Kevin

On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan <[hidden email]> wrote:

>
> Hi all,
>
> I believe that the internal row names object created at this line in
> `row_names_gets()` should be using `-n`, not `n`.
> https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
>
> This can currently generate corrupt internal row names when using
> `attributes<-` or `structure()`, which calls `attributes<-`.
>
> # internal row names are typically `c(NA, -n)`
> df <- data.frame(x = 1:3)
> .row_names_info(df, type = 0L)
> #> [1] NA -3
>
> # using `attributes()` materializes their non-internal form
> attrs <- attributes(df)
> attrs
> #> $names
> #> [1] "x"
> #>
> #> $class
> #> [1] "data.frame"
> #>
> #> $row.names
> #> [1] 1 2 3
>
> # let's make a data frame from scratch with `attributes<-`
> data <- list(x = 1:3)
> attributes(data) <- attrs
>
> # oh no!
> .row_names_info(data, type = 0L)
> #> [1] NA  3
>
> # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> # internal row names are not attempted to be created in the C level
> # `row_names_gets()`
>
> Thanks,
> Davis
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt internal row names when creating a data.frame with `attributes<-`

Bill Dunlap-2
as.matrix.data.frame does not take the absolute value of that number:
  > dPos <- structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,+3L))
  > dNeg <- structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,-3L))
  > rownames(as.matrix(dPos))
  [1] "1" "2" "3"
  > rownames(as.matrix(dNeg))
  NULL

-Bill

On Tue, Feb 16, 2021 at 11:06 AM Kevin Ushey <[hidden email]> wrote:

>
> Strictly speaking, I don't think this is a "corrupt" representation,
> given that any APIs used to access that internal representation will
> call abs() on the row count encoded within. At least, as far as I can
> tell, there aren't any adverse downstream effects from having the row
> names attribute encoded with this particular internal representation.
>
> On the other hand, the documentation in ?.row_names_info states, for
> the 'type' argument:
>
> integer. Currently type = 0 returns the internal "row.names" attribute
> (possibly NULL), type = 2 the number of rows implied by the attribute,
> and type = 1 the latter with a negative sign for ‘automatic’ row
> names.
>
> so one could argue that it's incorrect in light of that documentation
> (the row names are "automatic", but the row count is not marked with a
> negative sign). Or perhaps this is a different "type" of internal
> automatic row name, since it was generated from an already-existing
> integer sequence rather than "automatically" in a call to
> data.frame().
>
> Kevin
>
> On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan <[hidden email]> wrote:
> >
> > Hi all,
> >
> > I believe that the internal row names object created at this line in
> > `row_names_gets()` should be using `-n`, not `n`.
> > https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
> >
> > This can currently generate corrupt internal row names when using
> > `attributes<-` or `structure()`, which calls `attributes<-`.
> >
> > # internal row names are typically `c(NA, -n)`
> > df <- data.frame(x = 1:3)
> > .row_names_info(df, type = 0L)
> > #> [1] NA -3
> >
> > # using `attributes()` materializes their non-internal form
> > attrs <- attributes(df)
> > attrs
> > #> $names
> > #> [1] "x"
> > #>
> > #> $class
> > #> [1] "data.frame"
> > #>
> > #> $row.names
> > #> [1] 1 2 3
> >
> > # let's make a data frame from scratch with `attributes<-`
> > data <- list(x = 1:3)
> > attributes(data) <- attrs
> >
> > # oh no!
> > .row_names_info(data, type = 0L)
> > #> [1] NA  3
> >
> > # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> > # internal row names are not attempted to be created in the C level
> > # `row_names_gets()`
> >
> > Thanks,
> > Davis
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt internal row names when creating a data.frame with `attributes<-`

Davis Vaughan
This originally came up in this dplyr issue:
https://github.com/tidyverse/dplyr/issues/5745

Where `tibble::column_to_rownames()` failed because it eventually checks
`.row_names_info(.data) > 0L` to see if there are automatic row names,
which is in line with the documentation that Kevin pointed out: "type = 1
the latter with a negative sign for ‘automatic’ row names."

Davis

On Tue, Feb 16, 2021 at 2:29 PM Bill Dunlap <[hidden email]>
wrote:

> as.matrix.data.frame does not take the absolute value of that number:
>   > dPos <-
> structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,+3L))
>   > dNeg <-
> structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,-3L))
>   > rownames(as.matrix(dPos))
>   [1] "1" "2" "3"
>   > rownames(as.matrix(dNeg))
>   NULL
>
> -Bill
>
> On Tue, Feb 16, 2021 at 11:06 AM Kevin Ushey <[hidden email]> wrote:
> >
> > Strictly speaking, I don't think this is a "corrupt" representation,
> > given that any APIs used to access that internal representation will
> > call abs() on the row count encoded within. At least, as far as I can
> > tell, there aren't any adverse downstream effects from having the row
> > names attribute encoded with this particular internal representation.
> >
> > On the other hand, the documentation in ?.row_names_info states, for
> > the 'type' argument:
> >
> > integer. Currently type = 0 returns the internal "row.names" attribute
> > (possibly NULL), type = 2 the number of rows implied by the attribute,
> > and type = 1 the latter with a negative sign for ‘automatic’ row
> > names.
> >
> > so one could argue that it's incorrect in light of that documentation
> > (the row names are "automatic", but the row count is not marked with a
> > negative sign). Or perhaps this is a different "type" of internal
> > automatic row name, since it was generated from an already-existing
> > integer sequence rather than "automatically" in a call to
> > data.frame().
> >
> > Kevin
> >
> > On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan <[hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > > I believe that the internal row names object created at this line in
> > > `row_names_gets()` should be using `-n`, not `n`.
> > >
> https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
> > >
> > > This can currently generate corrupt internal row names when using
> > > `attributes<-` or `structure()`, which calls `attributes<-`.
> > >
> > > # internal row names are typically `c(NA, -n)`
> > > df <- data.frame(x = 1:3)
> > > .row_names_info(df, type = 0L)
> > > #> [1] NA -3
> > >
> > > # using `attributes()` materializes their non-internal form
> > > attrs <- attributes(df)
> > > attrs
> > > #> $names
> > > #> [1] "x"
> > > #>
> > > #> $class
> > > #> [1] "data.frame"
> > > #>
> > > #> $row.names
> > > #> [1] 1 2 3
> > >
> > > # let's make a data frame from scratch with `attributes<-`
> > > data <- list(x = 1:3)
> > > attributes(data) <- attrs
> > >
> > > # oh no!
> > > .row_names_info(data, type = 0L)
> > > #> [1] NA  3
> > >
> > > # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> > > # internal row names are not attempted to be created in the C level
> > > # `row_names_gets()`
> > >
> > > Thanks,
> > > Davis
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt internal row names when creating a data.frame with `attributes<-`

Martin Maechler
>>>>> Davis Vaughan
>>>>>     on Tue, 16 Feb 2021 14:50:33 -0500 writes:

    > This originally came up in this dplyr issue:
    > https://github.com/tidyverse/dplyr/issues/5745

    > Where `tibble::column_to_rownames()` failed because it
    > eventually checks `.row_names_info(.data) > 0L` to see if
    > there are automatic row names, which is in line with the
    > documentation that Kevin pointed out: "type = 1 the latter
    > with a negative sign for ‘automatic’ row names."

    > Davis


    > On Tue, Feb 16, 2021 at 2:29 PM Bill Dunlap
    > <[hidden email]> wrote:

    >> as.matrix.data.frame does not take the absolute value of
    >> that number:

slightly changed and extended by MM {and as R script} :

------------------------------------------------------------------------

dPos <- structure(list(X=11:14, 1:4), class="data.frame", row.names=c(NA, +4L))
dNeg <- structure(list(X=11:14, 1:4), class="data.frame", row.names=c(NA, -4L))
##
all_rn_info <- function(x) lapply(setNames(,0:2),
                               function(tp) .row_names_info(x, type=tp))
str(all_rn_info(dPos))
## List of 3
##  $ 0: int [1:2] NA 4
##  $ 1: int 4
##  $ 2: int 4
str(all_rn_info(dNeg))
## List of 3
##  $ 0: int [1:2] NA -4
##  $ 1: int -4
##  $ 2: int 4
stopifnot(exprs = {
    identical(rownames(dPos), as.character(1:4))
    identical(rownames(dPos), rownames(dNeg))
    ## using as.matrix.data.frame() which differentiates, too :
    identical(rownames(as.matrix(dPos)), as.character(1:4))
    is.null  (rownames(as.matrix(dNeg)))
    ## BTW, also :
    identical(attributes(dPos), attributes(dNeg)) ## and hence also
    identical(dPos, dNeg) # another case where identical() is possibly too "tolerant"
})

## and for your interest, these *also* have both 'c(NA, +|n|)'  ==> give '+4'
.row_names_info(dInt1 <- structure(list(X=11:14, 1:4), class="data.frame", row.names=1:4))
.row_names_info(dInt2 <- local({ dd <- data.frame(X=11:14, 1:4, fix.empty.names = FALSE)
                                 attr(dd, "row.names") <- 1:4; dd }))
stopifnot(exprs = {
    identical(dInt1, dInt2)
    identical(all_rn_info(dInt1),
              all_rn_info(dInt2))
    identical(all_rn_info(dPos),
              all_rn_info(dInt1))
})

------------------------------------------------------------------------

There never was a conclusion here
(and the above is not the full context of the thread) ..
but if I understand Bill and his example (extended above) correctly,
he's indirectly hinting toward that there is **no bug** here :

1) You can use structure() well to get "truly automatic" row
   names by setting the row.names correctly to  c(NA, -3L)
   {yes, which is  c(NA_integer_, -3L) }

2) There's a subtle difference between *two* kinds of automatic
   row names, on purpose, notably used in  as.matrix.data.frame():
   c(NA, +3)  are automatic row names, too, but which translate also to
   matrix row names hence are somewhat slightly less automatic ...

   Note that you may see this documented by careful reading of
   the 'Note' in  help(row.names) *and* the 'Examples' section
   of that help page....

Last but not least:  We (R Core) did not document the nitty
gritty details here partly on purpose, because they should've
been subject to change, see e.g. the word "currently" in the
?row.names help page.

Notably with ALTREP objects, we could use "regular"  1:n
integer row names which would be ALTREP compacted automatically
for non-small 'n'.

Last but not least, the check in tibble that was mentioned in
this thread, should almost surely be fixed, if gives a problem
for these example, and I claim it has been *that* code that has
buggy rather than base R's one.

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel