Invisible names problem

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Invisible names problem

Pan Domu
I ran into strange behavior when removing names.

Two ways of removing names:

    i <- rep(1:4, length.out=20000)
    k <- c(a=1, b=2, c=3, d=4)

    x1 <- unname(k[i])
    x2 <- k[i]
    x2 <- unname(x2)

Are they identical?

    identical(x1,x2) # TRUE

but no

    identical(serialize(x1,NULL),serialize(x2,NULL)) # FALSE

But problem is with serialization type 3, cause:

    identical(serialize(x1,NULL,version = 2),serialize(x2,NULL,version =
2)) # TRUE

It seems that the second one keeps names somewhere invisibly.

Some function can lost them, e.g. head:

    identical(serialize(head(x1, 20001),NULL),serialize(head(x2,
20001),NULL)) # TRUE

But not saveRDS (so files are bigger), tibble family keeps them but base
data.frame seems to drop them.

From my test invisible names are in following cases:

   x1 <- k[i] %>% unname()
   x3 <- k[i]; x3 <- unname(x3)
   x5 <- k[i]; x5 <- `names<-`(x5, NULL)
   x6 <- k[i]; x6 <- unname(x6)

but not in this one
   x2 <- unname(k[i])
   x4 <- k[i]; names(x4) <- NULL

What kind of magick is that?

It hits us when we upgrade from 3.5 (when serialization changed) and had
impact on parallelization (cause serialized objects were bigger).

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Invisible names problem

Simon Urbanek
Very interesting:

> .Internal(inspect(k[i]))
@10a4bc000 14 REALSXP g0c7 [ATT] (len=20000, tl=0) 1,2,3,4,1,...
ATTRIB:
  @7fa24f07fa58 02 LISTSXP g0c0 [REF(1)]
    TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5814),LCK,gp=0x6000] "names" (has value)
    @10a4e4000 16 STRSXP g0c7 [REF(1)] (len=20000, tl=0)
      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] "a"
      @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(35010),gp=0x61] [ASCII] [cached] "b"
      @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(35082),gp=0x61] [ASCII] [cached] "c"
      @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(35003),gp=0x61] [ASCII] [cached] "d"
      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] "a"
      ...

> .Internal(inspect(unname(k[i])))
@10a50c000 14 REALSXP g0c7 [] (len=20000, tl=0) 1,2,3,4,1,...

> .Internal(inspect(x2))
@7fa24fc692d8 14 REALSXP g0c0 [REF(1)]  wrapper [srt=-2147483648,no_na=0]
  @10a228000 14 REALSXP g0c7 [REF(1),ATT] (len=20000, tl=0) 1,2,3,4,1,...
  ATTRIB:
    @7fa24fc69850 02 LISTSXP g0c0 [REF(1)]
      TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5797),LCK,gp=0x4000] "names" (has value)
      @10a250000 16 STRSXP g0c7 [REF(65535)] (len=20000, tl=0)
        @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] [cached] "a"
        @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(10010),gp=0x61] [ASCII] [cached] "b"
        @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(10077),gp=0x61] [ASCII] [cached] "c"
        @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(10003),gp=0x61] [ASCII] [cached] "d"
        @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] [cached] "a"
        ...

If you don't assign the intermediate result things are simple as R knows there are no references so the names can be simply removed. However, if you assign the result that is not possible as there is still the reference in x2 at the time when unname() creates its own local temporary variable obj to do what probably most of us would use which is names(obj) <- NULL (i.e. names(x2) <- NULL avoids that problem.since you don't need both x2 and obj).

To be precise, when you use unname() on an assigned object, R has to technically keep two copies - one for the existing x2 and a second in unname() for obj so it can call names(obj)<-NULL for the modification. To avoid that R instead creates a wrapper for the original x2 which says "like x2 but names are NULL". The rationale is that for large vector it is better to keep records of metadata changes rather than duplicating the object. This way the vector is stored only once. However, as you blow way the original x2, all that is left is k[I] with the extra information "don't use the names". Unfortunately, R cannot know that you will eventually only keep the version without the names - at which point it could strip the names since they are not referenced anymore.

I'm not sure what is the best solution here. In theory, if the wrapper found out that the object it is wrapping has no more references it could remove the names, but I'm sure that would only solve some cases (what if you duplicated the wrapper and thus there were multiple wrappers referencing it?) and not sure if it has a way to find out. The other way to deal with that would be at serialization time if it could be detected such that it can remove the wrapper. Since the intersection of serialization experts and ALTREP experts is exactly one, I'll leave it to that set to comment further ;).

Cheers,
Simon



> On Jul 23, 2020, at 07:29, Pan Domu <[hidden email]> wrote:
>
> I ran into strange behavior when removing names.
>
> Two ways of removing names:
>
>    i <- rep(1:4, length.out=20000)
>    k <- c(a=1, b=2, c=3, d=4)
>
>    x1 <- unname(k[i])
>    x2 <- k[i]
>    x2 <- unname(x2)
>
> Are they identical?
>
>    identical(x1,x2) # TRUE
>
> but no
>
>    identical(serialize(x1,NULL),serialize(x2,NULL)) # FALSE
>
> But problem is with serialization type 3, cause:
>
>    identical(serialize(x1,NULL,version = 2),serialize(x2,NULL,version =
> 2)) # TRUE
>
> It seems that the second one keeps names somewhere invisibly.
>
> Some function can lost them, e.g. head:
>
>    identical(serialize(head(x1, 20001),NULL),serialize(head(x2,
> 20001),NULL)) # TRUE
>
> But not saveRDS (so files are bigger), tibble family keeps them but base
> data.frame seems to drop them.
>
> From my test invisible names are in following cases:
>
>   x1 <- k[i] %>% unname()
>   x3 <- k[i]; x3 <- unname(x3)
>   x5 <- k[i]; x5 <- `names<-`(x5, NULL)
>   x6 <- k[i]; x6 <- unname(x6)
>
> but not in this one
>   x2 <- unname(k[i])
>   x4 <- k[i]; names(x4) <- NULL
>
> What kind of magick is that?
>
> It hits us when we upgrade from 3.5 (when serialization changed) and had
> impact on parallelization (cause serialized objects were bigger).
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Invisible names problem

Duncan Murdoch-2
In reply to this post by Pan Domu
On 22/07/2020 3:29 p.m., Pan Domu wrote:

> I ran into strange behavior when removing names.
>
> Two ways of removing names:
>
>      i <- rep(1:4, length.out=20000)
>      k <- c(a=1, b=2, c=3, d=4)
>
>      x1 <- unname(k[i])
>      x2 <- k[i]
>      x2 <- unname(x2)
>
> Are they identical?
>
>      identical(x1,x2) # TRUE
>
> but no
>
>      identical(serialize(x1,NULL),serialize(x2,NULL)) # FALSE
>
> But problem is with serialization type 3, cause:
>
>      identical(serialize(x1,NULL,version = 2),serialize(x2,NULL,version =
> 2)) # TRUE
>
> It seems that the second one keeps names somewhere invisibly.
>
> Some function can lost them, e.g. head:
>
>      identical(serialize(head(x1, 20001),NULL),serialize(head(x2,
> 20001),NULL)) # TRUE
>
> But not saveRDS (so files are bigger), tibble family keeps them but base
> data.frame seems to drop them.
>
>  From my test invisible names are in following cases:
>
>     x1 <- k[i] %>% unname()
>     x3 <- k[i]; x3 <- unname(x3)
>     x5 <- k[i]; x5 <- `names<-`(x5, NULL)
>     x6 <- k[i]; x6 <- unname(x6)
>
> but not in this one
>     x2 <- unname(k[i])
>     x4 <- k[i]; names(x4) <- NULL
>
> What kind of magick is that?
>
> It hits us when we upgrade from 3.5 (when serialization changed) and had
> impact on parallelization (cause serialized objects were bigger).

You can use .Internal(inspect(x1)) and .Internal(inspect(x2)) to see
that the two objects are not identical:

 > .Internal(inspect(x1))
@1116b7000 14 REALSXP g0c7 [REF(2)] (len=20000, tl=0) 1,2,3,4,1,...
 > .Internal(inspect(x2))
@7f9c77664ce8 14 REALSXP g0c0 [REF(2)]  wrapper [srt=-2147483648,no_na=0]
   @10e7b7000 14 REALSXP g0c7 [REF(6),ATT] (len=20000, tl=0) 1,2,3,4,1,...
   ATTRIB:
     @7f9c77664738 02 LISTSXP g0c0 [REF(1)]
       TAG: @7f9c6c027890 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x4000]
"names" (has value)
       @10e3ac000 16 STRSXP g0c7 [REF(65535)] (len=20000, tl=0)
        @7f9c6ab531c8 09 CHARSXP g1c1 [MARK,REF(10066),gp=0x61] [ASCII]
[cached] "a"
        @7f9c6ae9a678 09 CHARSXP g1c1 [MARK,REF(10013),gp=0x61] [ASCII]
[cached] "b"
        @7f9c6c0496c0 09 CHARSXP g1c1 [MARK,REF(10568),gp=0x61,ATT] [ASCII]
[cached] "c"
        @7f9c6ad3df40 09 CHARSXP g1c1 [MARK,REF(10029),gp=0x61,ATT] [ASCII]
[cached] "d"
        @7f9c6ab531c8 09 CHARSXP g1c1 [MARK,REF(10066),gp=0x61] [ASCII]
[cached] "a"
        ...


It looks as though x2 is a tiny ALTREP object acting as a wrapper on the
original k[i], but I might be misinterpreting those displays.  I don't
know how to force ALTREP objects to standard representation:
unserializing the serialized x2 gives something like x2, not like x1.
Maybe you want to look at one of the contributed low level packages.
The stringfish package has a "materialize" function that is advertised
to convert anything to standard format, but it doesn't change x2.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: Invisible names problem

luke-tierney
In reply to this post by Simon Urbanek
On Wed, 22 Jul 2020, Simon Urbanek wrote:

> Very interesting:
>
>> .Internal(inspect(k[i]))
> @10a4bc000 14 REALSXP g0c7 [ATT] (len=20000, tl=0) 1,2,3,4,1,...
> ATTRIB:
>  @7fa24f07fa58 02 LISTSXP g0c0 [REF(1)]
>    TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5814),LCK,gp=0x6000] "names" (has value)
>    @10a4e4000 16 STRSXP g0c7 [REF(1)] (len=20000, tl=0)
>      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] "a"
>      @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(35010),gp=0x61] [ASCII] [cached] "b"
>      @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(35082),gp=0x61] [ASCII] [cached] "c"
>      @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(35003),gp=0x61] [ASCII] [cached] "d"
>      @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(35005),gp=0x61] [ASCII] [cached] "a"
>      ...
>
>> .Internal(inspect(unname(k[i])))
> @10a50c000 14 REALSXP g0c7 [] (len=20000, tl=0) 1,2,3,4,1,...
>
>> .Internal(inspect(x2))
> @7fa24fc692d8 14 REALSXP g0c0 [REF(1)]  wrapper [srt=-2147483648,no_na=0]
>  @10a228000 14 REALSXP g0c7 [REF(1),ATT] (len=20000, tl=0) 1,2,3,4,1,...
>  ATTRIB:
>    @7fa24fc69850 02 LISTSXP g0c0 [REF(1)]
>      TAG: @7fa24b803e90 01 SYMSXP g0c0 [MARK,REF(5797),LCK,gp=0x4000] "names" (has value)
>      @10a250000 16 STRSXP g0c7 [REF(65535)] (len=20000, tl=0)
> @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] [cached] "a"
> @7fa24be24428 09 CHARSXP g0c1 [MARK,REF(10010),gp=0x61] [ASCII] [cached] "b"
> @7fa24b806ec0 09 CHARSXP g0c1 [MARK,REF(10077),gp=0x61] [ASCII] [cached] "c"
> @7fa24bcc6af0 09 CHARSXP g0c1 [MARK,REF(10003),gp=0x61] [ASCII] [cached] "d"
> @7fa24ba575c8 09 CHARSXP g0c1 [MARK,REF(10005),gp=0x61] [ASCII] [cached] "a"
> ...
>
> If you don't assign the intermediate result things are simple as R knows there are no references so the names can be simply removed. However, if you assign the result that is not possible as there is still the reference in x2 at the time when unname() creates its own local temporary variable obj to do what probably most of us would use which is names(obj) <- NULL (i.e. names(x2) <- NULL avoids that problem.since you don't need both x2 and obj).
>
> To be precise, when you use unname() on an assigned object, R has to technically keep two copies - one for the existing x2 and a second in unname() for obj so it can call names(obj)<-NULL for the modification. To avoid that R instead creates a wrapper for the original x2 which says "like x2 but names are NULL". The rationale is that for large vector it is better to keep records of metadata changes rather than duplicating the object. This way the vector is stored only once. However, as you blow way the original x2, all that is left is k[I] with the extra information "don't use the names". Unfortunately, R cannot know that you will eventually only keep the version without the names - at which point it could strip the names since they are not referenced anymore.
>
> I'm not sure what is the best solution here. In theory, if the wrapper found out that the object it is wrapping has no more references it could remove the names, but I'm sure that would only solve some cases (what if you duplicated the wrapper and thus there were multiple wrappers referencing it?) and not sure if it has a way to find out. The other way to deal with that would be at serialization time if it could be detected such that it can remove the wrapper. Since the intersection of serialization experts and ALTREP experts is exactly one, I'll leave it to that set to comment further ;).

Currently the wrapper serialization mechanism just serializes the
wrapped object and unserialize re-wraps it at the other end.

If there is only one reference to the wrapped value then we know the
attributes can't be accessed from the R level anymore, so it would be
safe to remove the attributes before passing it off for serializing.
Unless I'm missing something that would be an easy change. But it
would be good to know if it would really make a difference in
realistic situations.

[Dropping attributes could be done at other times as well if there is
only one reference, e.g. on accessing the data, but that is not likely
to be worth while within a single R session.]

If there is more than one reference to the wrapped object, then things
is more complicated. We could duplicate the payload and send that off
for serialization (and install it in the wrapper), but that could be a
bad idea of the object is large.

A tighter integration of ALTREP serialization with the serialization
internals might allow and ALTREP's serialization method to write
directly to the serialization stream, but that would make things much
harder to maintain.

Best,

luke

>
> Cheers,
> Simon
>
>
>
>> On Jul 23, 2020, at 07:29, Pan Domu <[hidden email]> wrote:
>>
>> I ran into strange behavior when removing names.
>>
>> Two ways of removing names:
>>
>>    i <- rep(1:4, length.out=20000)
>>    k <- c(a=1, b=2, c=3, d=4)
>>
>>    x1 <- unname(k[i])
>>    x2 <- k[i]
>>    x2 <- unname(x2)
>>
>> Are they identical?
>>
>>    identical(x1,x2) # TRUE
>>
>> but no
>>
>>    identical(serialize(x1,NULL),serialize(x2,NULL)) # FALSE
>>
>> But problem is with serialization type 3, cause:
>>
>>    identical(serialize(x1,NULL,version = 2),serialize(x2,NULL,version =
>> 2)) # TRUE
>>
>> It seems that the second one keeps names somewhere invisibly.
>>
>> Some function can lost them, e.g. head:
>>
>>    identical(serialize(head(x1, 20001),NULL),serialize(head(x2,
>> 20001),NULL)) # TRUE
>>
>> But not saveRDS (so files are bigger), tibble family keeps them but base
>> data.frame seems to drop them.
>>
>> From my test invisible names are in following cases:
>>
>>   x1 <- k[i] %>% unname()
>>   x3 <- k[i]; x3 <- unname(x3)
>>   x5 <- k[i]; x5 <- `names<-`(x5, NULL)
>>   x6 <- k[i]; x6 <- unname(x6)
>>
>> but not in this one
>>   x2 <- unname(k[i])
>>   x4 <- k[i]; names(x4) <- NULL
>>
>> What kind of magick is that?
>>
>> It hits us when we upgrade from 3.5 (when serialization changed) and had
>> impact on parallelization (cause serialized objects were bigger).
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel