Objectsize function visiting every element for alt-rep strings

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Objectsize function visiting every element for alt-rep strings

Travers Ching
I have a toy alt-rep string package that generates randomly seeded strings.

example:
library(altstringisode)
x <- altrandomStrings(1e8)
head(x)
[1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc
object.size(1e8)

Object.size will call the set_altstring_Elt_method for every single
element, materializing (slowly) every element of the vector.  This is
a problem mostly in R-studio since object.size is called
automatically, defeating the purpose of alt-rep.

Is there a way to avoid the problem of forced materialization in rstudio?

PS: Is there a way to tell if a post has been received by the mailing
list?  How long does it take to show up in the archives?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Gabriel Becker-2
Travers,

Great to hear you're trying out the ALTREP stuff, good on you :).

Did you mean the get_altstring_Elt_method? I see the code in size.c within
utils that grabs each element, but I don't see any setting (and the setters
are noops currently anyway they just do things the old way).

One thing we have to decide is what object.size means for an altrep. I tend
to think it should mean the size of the alternative representation
currently in use in memory, but I see that a small note in ?object.size
indicates that size of objects with compact internal representations may be
overestimated, so technically this is "as currently documented". The "we"
here, of course, is the R-core team so we'll have to see how they feel on
the matter.

As for what to do about it, one possibility is to add an object.size method
to the ALTREP method table that gets called if object.size is called on an
ALTREP object.  In this case, it would be up  to the class to define an
appropriate object.size method. That would be relatively easy to do from a
technical standpoint on R's side, but what comes out of object.size would
be a bit "Wild West-y", without the consistency and correctness guarantees
one might expect from a function in utils.

Another option is to to have object.size recurse to calling object.size on
the two parts (SEXPS which together make up a CONS cell, I believe) that
make up an ALTREP  internally. Roughly speaking one of these is usually the
alternative representation while the other is the spot to put an object
with the traditional representation if the payload is ever fully
materialized in an altrep-unsafe way - e.g., C code grabs a writable
dataptr via INTEGER, REAL, DATAPTR, etc. Note there are exceptions to what
I said above, though,such as the wrapper ALTREP classes which always have
the parent object (typically a traditionally laid-out vector), because the
"alternative representation" part is strictly a metadata annotation in that
case and contains no representation of the payload data for those classes.

In this second case the result of object.size would be consistent across
all ALTREP classes, but in both cases the result of object.size would no
longer give any information about the size of a vector *payload*. This is
consistent with how object.size deals with external pointers now, but could
lead to some surprise in the case of vectors which the end user may not
even know are ALTREPs.

Thoughts from anyone else on this list?

Anyway, thanks for pointing this out. I'll talk with Luke and see what
makes sense to do here.

Best,
~G

On Wed, Jan 16, 2019 at 3:49 AM Travers Ching <[hidden email]> wrote:

> I have a toy alt-rep string package that generates randomly seeded strings.
>
> example:
> library(altstringisode)
> x <- altrandomStrings(1e8)
> head(x)
> [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ...
> etc
> object.size(1e8)
>
> Object.size will call the set_altstring_Elt_method for every single
> element, materializing (slowly) every element of the vector.  This is
> a problem mostly in R-studio since object.size is called
> automatically, defeating the purpose of alt-rep.
>
> Is there a way to avoid the problem of forced materialization in rstudio?
>
> PS: Is there a way to tell if a post has been received by the mailing
> list?  How long does it take to show up in the archives?
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Travers Ching
Thanks for the detailed response, Gabriel!

I think that an object_size alt-rep method that package developers
need to implement might be hard to get right.  One alternative could
be an alt-rep method that returns the number of bytes/characters in a
given string element since I believe the object size of a CHARSXP
depends only on string length?  I think two optional alt-string
methods would be nice:

`alt_string_elt_nchars` -- for the `nchar` function in R
`alt_string_elt_nbytes` -- for `object.size` (which might be different
than nchars due to encoding)

Also since it's an issue that mainly affects R-studio, I started an
issue on their github, and it sounds like they'll avoid calling
object.size on alt-rep objects automatically.  That would fix the main
problem I've been having.

Thanks,
Travers

On Fri, Jan 18, 2019 at 2:49 PM Gabriel Becker <[hidden email]> wrote:

>
> Travers,
>
> Great to hear you're trying out the ALTREP stuff, good on you :).
>
> Did you mean the get_altstring_Elt_method? I see the code in size.c within utils that grabs each element, but I don't see any setting (and the setters are noops currently anyway they just do things the old way).
>
> One thing we have to decide is what object.size means for an altrep. I tend to think it should mean the size of the alternative representation currently in use in memory, but I see that a small note in ?object.size indicates that size of objects with compact internal representations may be overestimated, so technically this is "as currently documented". The "we" here, of course, is the R-core team so we'll have to see how they feel on the matter.
>
> As for what to do about it, one possibility is to add an object.size method to the ALTREP method table that gets called if object.size is called on an ALTREP object.  In this case, it would be up  to the class to define an appropriate object.size method. That would be relatively easy to do from a technical standpoint on R's side, but what comes out of object.size would be a bit "Wild West-y", without the consistency and correctness guarantees one might expect from a function in utils.
>
> Another option is to to have object.size recurse to calling object.size on the two parts (SEXPS which together make up a CONS cell, I believe) that make up an ALTREP  internally. Roughly speaking one of these is usually the alternative representation while the other is the spot to put an object with the traditional representation if the payload is ever fully materialized in an altrep-unsafe way - e.g., C code grabs a writable dataptr via INTEGER, REAL, DATAPTR, etc. Note there are exceptions to what I said above, though,such as the wrapper ALTREP classes which always have the parent object (typically a traditionally laid-out vector), because the "alternative representation" part is strictly a metadata annotation in that case and contains no representation of the payload data for those classes.
>
> In this second case the result of object.size would be consistent across all ALTREP classes, but in both cases the result of object.size would no longer give any information about the size of a vector payload. This is consistent with how object.size deals with external pointers now, but could lead to some surprise in the case of vectors which the end user may not even know are ALTREPs.
>
> Thoughts from anyone else on this list?
>
> Anyway, thanks for pointing this out. I'll talk with Luke and see what makes sense to do here.
>
> Best,
> ~G
>
> On Wed, Jan 16, 2019 at 3:49 AM Travers Ching <[hidden email]> wrote:
>>
>> I have a toy alt-rep string package that generates randomly seeded strings.
>>
>> example:
>> library(altstringisode)
>> x <- altrandomStrings(1e8)
>> head(x)
>> [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc
>> object.size(1e8)
>>
>> Object.size will call the set_altstring_Elt_method for every single
>> element, materializing (slowly) every element of the vector.  This is
>> a problem mostly in R-studio since object.size is called
>> automatically, defeating the purpose of alt-rep.
>>
>> Is there a way to avoid the problem of forced materialization in rstudio?
>>
>> PS: Is there a way to tell if a post has been received by the mailing
>> list?  How long does it take to show up in the archives?
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Martin Maechler
In reply to this post by Travers Ching
>>>>> Travers Ching
>>>>>     on Tue, 15 Jan 2019 12:50:45 -0800 writes:

    > I have a toy alt-rep string package that generates
    > randomly seeded strings.  example: library(altstringisode)
    > x <- altrandomStrings(1e8) head(x) [1]
    > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
    > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)

    > Object.size will call the set_altstring_Elt_method for
    > every single element, materializing (slowly) every element
    > of the vector.  This is a problem mostly in R-studio since
    > object.size is called automatically, defeating the purpose
    > of alt-rep.

Hmm.  But still, the idea had been that object.size()  *shuld*
return the size of the "de-ALTREP'ed" object *but* should not
de-ALTREP it.
That's what happens for integers, but indeed fails to happen for
such as.character(.)ed integers.

From my eRum presentation (which took from the official ALTREP documentation
https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :

  > x <- 1:1e15
  > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
  8000000000000048 bytes
  > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
  [1] FALSE
  > xs <- sort(x)  #
  > .Internal(inspect(x))
  @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
  >

  > cx <- as.character(x)
  > .Internal(inspect(cx))
  @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
    @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
  > system.time( print(object.size(x)), gc=FALSE)
  8000000000000048 bytes
     user  system elapsed
    0.000   0.000   0.001
  > system.time( print(object.size(cx)), gc=FALSE)
  Error: cannot allocate vector of size 8388608.0 Gb
  Timing stopped at: 11.43 0 11.46
  >

One could consider it a bug that object.size(cx) is indeed
inspecting every string, i.e., accessing cx[i] for all i.
Note that it is *not*  deALTREPing cx  itself :

> x <- 1:1e6
> cx <- as.character(x)
> .Internal(inspect(cx))

@7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> system.time( print(object.size(cx)), gc=FALSE)
64000048 bytes
   user  system elapsed
  0.369   0.005   0.374
> .Internal(inspect(cx))
@7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>

    > Is there a way to avoid the problem of forced
    > materialization in rstudio?

    > PS: Is there a way to tell if a post has been received by
    > the mailing list?  How long does it take to show up in the
    > archives?

[ that (waiting time) distribution is quite right skewed... I'd
  guess it's median to be less than 10 minutes... but we had
  artificially delayed it somewhat in the past to fight
  spammers, and ETH (the hosting instituttion) and others have
  increased spam and virus filtering so everything has become
  quite a bit slower ]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Tierney, Luke
On Mon, 21 Jan 2019, Martin Maechler wrote:

>>>>>> Travers Ching
>>>>>>     on Tue, 15 Jan 2019 12:50:45 -0800 writes:
>
>    > I have a toy alt-rep string package that generates
>    > randomly seeded strings.  example: library(altstringisode)
>    > x <- altrandomStrings(1e8) head(x) [1]
>    > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
>    > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
>
>    > Object.size will call the set_altstring_Elt_method for
>    > every single element, materializing (slowly) every element
>    > of the vector.  This is a problem mostly in R-studio since
>    > object.size is called automatically, defeating the purpose
>    > of alt-rep.

There is no sensible way in general to figure out how large the
strings would be without computing them. There might be specifically
for a deferred sequence conversion but it would require a fair bit of
effort to figure out that would be better spent elsewhere.

I've never been a big fan of object.size since what it is trying to
compute isn't very well defined in the context of sharing and possible
internal state changes (even before ALTREP byte code compilation could
change the internals of a function [which object.size sees] and
assigning into environments or evaluating promises can change
environments [which object.size ignores]). The issue is not unlike the
one faced by identical(), which has a bunch of options for the
different ways objects can be identical, and might need even more.

We could in general have object.size for and ALTREP return the
object.size results of the current internal representation, but that
might not always be appropriate. Again, what object.size is trying to
compute isn't very well defined.

RStudio does seem to call object.size on every assignment to
.GlobalEnv. That might be worth revisiting.


Best,

luke

>
> Hmm.  But still, the idea had been that object.size()  *shuld*
> return the size of the "de-ALTREP'ed" object *but* should not
> de-ALTREP it.
> That's what happens for integers, but indeed fails to happen for
> such as.character(.)ed integers.
>
> From my eRum presentation (which took from the official ALTREP documentation
> https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
>
>  > x <- 1:1e15
>  > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
>  8000000000000048 bytes
>  > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
>  [1] FALSE
>  > xs <- sort(x)  #
>  > .Internal(inspect(x))
>  @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
>  >
>
>  > cx <- as.character(x)
>  > .Internal(inspect(cx))
>  @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>    @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
>  > system.time( print(object.size(x)), gc=FALSE)
>  8000000000000048 bytes
>     user  system elapsed
>    0.000   0.000   0.001
>  > system.time( print(object.size(cx)), gc=FALSE)
>  Error: cannot allocate vector of size 8388608.0 Gb
>  Timing stopped at: 11.43 0 11.46
>  >
>
> One could consider it a bug that object.size(cx) is indeed
> inspecting every string, i.e., accessing cx[i] for all i.
> Note that it is *not*  deALTREPing cx  itself :
>
>> x <- 1:1e6
>> cx <- as.character(x)
>> .Internal(inspect(cx))
>
> @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>> system.time( print(object.size(cx)), gc=FALSE)
> 64000048 bytes
>   user  system elapsed
>  0.369   0.005   0.374
>> .Internal(inspect(cx))
> @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
>  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>>
>
>    > Is there a way to avoid the problem of forced
>    > materialization in rstudio?
>
>    > PS: Is there a way to tell if a post has been received by
>    > the mailing list?  How long does it take to show up in the
>    > archives?
>
> [ that (waiting time) distribution is quite right skewed... I'd
>  guess it's median to be less than 10 minutes... but we had
>  artificially delayed it somewhat in the past to fight
>  spammers, and ETH (the hosting instituttion) and others have
>  increased spam and virus filtering so everything has become
>  quite a bit slower ]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Kevin Ushey
I think that object.size() is most commonly used to answer the question,
"what R objects are consuming the most memory currently in my R session?"
and for that reason I think returning the size of the internal
representations of objects (for e.g. ALTREP objects; unevaluated promises)
is the right default behavior.

I also agree it would be worth considering adding arguments that control
how object.size() is computed for different kinds of R objects, since users
might want to use object.size() to answer different types of questions.

All that said, if the ultimate goal here is to avoid having RStudio
materialize ALTREP objects in the background, then perhaps that change
should happen in RStudio :-)

Best,
Kevin

On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <[hidden email]>
wrote:

> On Mon, 21 Jan 2019, Martin Maechler wrote:
>
> >>>>>> Travers Ching
> >>>>>>     on Tue, 15 Jan 2019 12:50:45 -0800 writes:
> >
> >    > I have a toy alt-rep string package that generates
> >    > randomly seeded strings.  example: library(altstringisode)
> >    > x <- altrandomStrings(1e8) head(x) [1]
> >    > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
> >    > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
> >
> >    > Object.size will call the set_altstring_Elt_method for
> >    > every single element, materializing (slowly) every element
> >    > of the vector.  This is a problem mostly in R-studio since
> >    > object.size is called automatically, defeating the purpose
> >    > of alt-rep.
>
> There is no sensible way in general to figure out how large the
> strings would be without computing them. There might be specifically
> for a deferred sequence conversion but it would require a fair bit of
> effort to figure out that would be better spent elsewhere.
>
> I've never been a big fan of object.size since what it is trying to
> compute isn't very well defined in the context of sharing and possible
> internal state changes (even before ALTREP byte code compilation could
> change the internals of a function [which object.size sees] and
> assigning into environments or evaluating promises can change
> environments [which object.size ignores]). The issue is not unlike the
> one faced by identical(), which has a bunch of options for the
> different ways objects can be identical, and might need even more.
>
> We could in general have object.size for and ALTREP return the
> object.size results of the current internal representation, but that
> might not always be appropriate. Again, what object.size is trying to
> compute isn't very well defined.
>
> RStudio does seem to call object.size on every assignment to
> .GlobalEnv. That might be worth revisiting.
>
>
> Best,
>
> luke
>
> >
> > Hmm.  But still, the idea had been that object.size()  *shuld*
> > return the size of the "de-ALTREP'ed" object *but* should not
> > de-ALTREP it.
> > That's what happens for integers, but indeed fails to happen for
> > such as.character(.)ed integers.
> >
> > From my eRum presentation (which took from the official ALTREP
> documentation
> > https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
> >
> >  > x <- 1:1e15
> >  > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not
> really
> >  8000000000000048 bytes
> >  > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
> >  [1] FALSE
> >  > xs <- sort(x)  #
> >  > .Internal(inspect(x))
> >  @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
> >  >
> >
> >  > cx <- as.character(x)
> >  > .Internal(inspect(cx))
> >  @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >    @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
> >  > system.time( print(object.size(x)), gc=FALSE)
> >  8000000000000048 bytes
> >     user  system elapsed
> >    0.000   0.000   0.001
> >  > system.time( print(object.size(cx)), gc=FALSE)
> >  Error: cannot allocate vector of size 8388608.0 Gb
> >  Timing stopped at: 11.43 0 11.46
> >  >
> >
> > One could consider it a bug that object.size(cx) is indeed
> > inspecting every string, i.e., accessing cx[i] for all i.
> > Note that it is *not*  deALTREPing cx  itself :
> >
> >> x <- 1:1e6
> >> cx <- as.character(x)
> >> .Internal(inspect(cx))
> >
> > @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >> system.time( print(object.size(cx)), gc=FALSE)
> > 64000048 bytes
> >   user  system elapsed
> >  0.369   0.005   0.374
> >> .Internal(inspect(cx))
> > @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
> >  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >>
> >
> >    > Is there a way to avoid the problem of forced
> >    > materialization in rstudio?
> >
> >    > PS: Is there a way to tell if a post has been received by
> >    > the mailing list?  How long does it take to show up in the
> >    > archives?
> >
> > [ that (waiting time) distribution is quite right skewed... I'd
> >  guess it's median to be less than 10 minutes... but we had
> >  artificially delayed it somewhat in the past to fight
> >  spammers, and ETH (the hosting instituttion) and others have
> >  increased spam and virus filtering so everything has become
> >  quite a bit slower ]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Tomas Kalibera
On 1/22/19 6:17 PM, Kevin Ushey wrote:
> I think that object.size() is most commonly used to answer the question,
> "what R objects are consuming the most memory currently in my R session?"
> and for that reason I think returning the size of the internal
> representations of objects (for e.g. ALTREP objects; unevaluated promises)
> is the right default behavior.

I don't think one could answer that question at all in the presence of
sharing (of objects with value semantics due to copy on write, string
cache or other caches, sharing of objects with referential semantics
such as environments, etc). Also the mapping from R objects (SEXPs) to
what users might understand as objects would not be clear (which SEXPs
belong to which "object", which SEXPs are too low-level for the user to
be considered, etc). In principle, there could be a memory profiler
working at SEXP level and exposing all the intricacies of the memory
layout, answering reachability questions on a heap dump (so one could
find out about a 1G integer vector and then list all bindings say in
namespace environments from which it is reachable), but of course that
would be a lot of work to implement and to maintain. The problem is not
unique to R (e.g. see Java with the same problems of sharing that
prevent meaningful definition for object size). I am not persuaded it
makes sense to add more options to a function that does not have and
cannot have a well defined user-level semantics, and I would discourage
writing code that is trying to build on that function as I think that it
might lead to confusion and frustration. I think equality for example is
easier to define (just that one could come up with multiple meaningful
definitions, so it makes sense to have multiple options).

Best
Tomas

>
> I also agree it would be worth considering adding arguments that control
> how object.size() is computed for different kinds of R objects, since users
> might want to use object.size() to answer different types of questions.
>
> All that said, if the ultimate goal here is to avoid having RStudio
> materialize ALTREP objects in the background, then perhaps that change
> should happen in RStudio :-)
>
> Best,
> Kevin
>
> On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <[hidden email]>
> wrote:
>
>> On Mon, 21 Jan 2019, Martin Maechler wrote:
>>
>>>>>>>> Travers Ching
>>>>>>>>      on Tue, 15 Jan 2019 12:50:45 -0800 writes:
>>>     > I have a toy alt-rep string package that generates
>>>     > randomly seeded strings.  example: library(altstringisode)
>>>     > x <- altrandomStrings(1e8) head(x) [1]
>>>     > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
>>>     > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
>>>
>>>     > Object.size will call the set_altstring_Elt_method for
>>>     > every single element, materializing (slowly) every element
>>>     > of the vector.  This is a problem mostly in R-studio since
>>>     > object.size is called automatically, defeating the purpose
>>>     > of alt-rep.
>> There is no sensible way in general to figure out how large the
>> strings would be without computing them. There might be specifically
>> for a deferred sequence conversion but it would require a fair bit of
>> effort to figure out that would be better spent elsewhere.
>>
>> I've never been a big fan of object.size since what it is trying to
>> compute isn't very well defined in the context of sharing and possible
>> internal state changes (even before ALTREP byte code compilation could
>> change the internals of a function [which object.size sees] and
>> assigning into environments or evaluating promises can change
>> environments [which object.size ignores]). The issue is not unlike the
>> one faced by identical(), which has a bunch of options for the
>> different ways objects can be identical, and might need even more.
>>
>> We could in general have object.size for and ALTREP return the
>> object.size results of the current internal representation, but that
>> might not always be appropriate. Again, what object.size is trying to
>> compute isn't very well defined.
>>
>> RStudio does seem to call object.size on every assignment to
>> .GlobalEnv. That might be worth revisiting.
>>
>>
>> Best,
>>
>> luke
>>
>>> Hmm.  But still, the idea had been that object.size()  *shuld*
>>> return the size of the "de-ALTREP'ed" object *but* should not
>>> de-ALTREP it.
>>> That's what happens for integers, but indeed fails to happen for
>>> such as.character(.)ed integers.
>>>
>>>  From my eRum presentation (which took from the official ALTREP
>> documentation
>>> https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
>>>
>>>   > x <- 1:1e15
>>>   > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not
>> really
>>>   8000000000000048 bytes
>>>   > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
>>>   [1] FALSE
>>>   > xs <- sort(x)  #
>>>   > .Internal(inspect(x))
>>>   @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
>>>   >
>>>
>>>   > cx <- as.character(x)
>>>   > .Internal(inspect(cx))
>>>   @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>>>     @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
>>>   > system.time( print(object.size(x)), gc=FALSE)
>>>   8000000000000048 bytes
>>>      user  system elapsed
>>>     0.000   0.000   0.001
>>>   > system.time( print(object.size(cx)), gc=FALSE)
>>>   Error: cannot allocate vector of size 8388608.0 Gb
>>>   Timing stopped at: 11.43 0 11.46
>>>   >
>>>
>>> One could consider it a bug that object.size(cx) is indeed
>>> inspecting every string, i.e., accessing cx[i] for all i.
>>> Note that it is *not*  deALTREPing cx  itself :
>>>
>>>> x <- 1:1e6
>>>> cx <- as.character(x)
>>>> .Internal(inspect(cx))
>>> @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>>>> system.time( print(object.size(cx)), gc=FALSE)
>>> 64000048 bytes
>>>    user  system elapsed
>>>   0.369   0.005   0.374
>>>> .Internal(inspect(cx))
>>> @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
>>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>>>     > Is there a way to avoid the problem of forced
>>>     > materialization in rstudio?
>>>
>>>     > PS: Is there a way to tell if a post has been received by
>>>     > the mailing list?  How long does it take to show up in the
>>>     > archives?
>>>
>>> [ that (waiting time) distribution is quite right skewed... I'd
>>>   guess it's median to be less than 10 minutes... but we had
>>>   artificially delayed it somewhat in the past to fight
>>>   spammers, and ETH (the hosting instituttion) and others have
>>>   increased spam and virus filtering so everything has become
>>>   quite a bit slower ]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>> --
>> Luke Tierney
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>      Actuarial Science
>> 241 Schaeffer Hall                  email:   [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Objectsize function visiting every element for alt-rep strings

Travers Ching
It should be possible to calculate object.size in the presence of
sharing, at least with respect to all sub-nodes of a SEXP.  E.g.,
during calculation, keep a hash of all SEXP pointers visited.  If a
pointer has already been visited, add only the size of the pointer to
the total object size.

Travers

On Wed, Jan 23, 2019 at 1:33 AM Tomas Kalibera <[hidden email]> wrote:

>
> On 1/22/19 6:17 PM, Kevin Ushey wrote:
> > I think that object.size() is most commonly used to answer the question,
> > "what R objects are consuming the most memory currently in my R session?"
> > and for that reason I think returning the size of the internal
> > representations of objects (for e.g. ALTREP objects; unevaluated promises)
> > is the right default behavior.
>
> I don't think one could answer that question at all in the presence of
> sharing (of objects with value semantics due to copy on write, string
> cache or other caches, sharing of objects with referential semantics
> such as environments, etc). Also the mapping from R objects (SEXPs) to
> what users might understand as objects would not be clear (which SEXPs
> belong to which "object", which SEXPs are too low-level for the user to
> be considered, etc). In principle, there could be a memory profiler
> working at SEXP level and exposing all the intricacies of the memory
> layout, answering reachability questions on a heap dump (so one could
> find out about a 1G integer vector and then list all bindings say in
> namespace environments from which it is reachable), but of course that
> would be a lot of work to implement and to maintain. The problem is not
> unique to R (e.g. see Java with the same problems of sharing that
> prevent meaningful definition for object size). I am not persuaded it
> makes sense to add more options to a function that does not have and
> cannot have a well defined user-level semantics, and I would discourage
> writing code that is trying to build on that function as I think that it
> might lead to confusion and frustration. I think equality for example is
> easier to define (just that one could come up with multiple meaningful
> definitions, so it makes sense to have multiple options).
>
> Best
> Tomas
> >
> > I also agree it would be worth considering adding arguments that control
> > how object.size() is computed for different kinds of R objects, since users
> > might want to use object.size() to answer different types of questions.
> >
> > All that said, if the ultimate goal here is to avoid having RStudio
> > materialize ALTREP objects in the background, then perhaps that change
> > should happen in RStudio :-)
> >
> > Best,
> > Kevin
> >
> > On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <[hidden email]>
> > wrote:
> >
> >> On Mon, 21 Jan 2019, Martin Maechler wrote:
> >>
> >>>>>>>> Travers Ching
> >>>>>>>>      on Tue, 15 Jan 2019 12:50:45 -0800 writes:
> >>>     > I have a toy alt-rep string package that generates
> >>>     > randomly seeded strings.  example: library(altstringisode)
> >>>     > x <- altrandomStrings(1e8) head(x) [1]
> >>>     > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
> >>>     > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
> >>>
> >>>     > Object.size will call the set_altstring_Elt_method for
> >>>     > every single element, materializing (slowly) every element
> >>>     > of the vector.  This is a problem mostly in R-studio since
> >>>     > object.size is called automatically, defeating the purpose
> >>>     > of alt-rep.
> >> There is no sensible way in general to figure out how large the
> >> strings would be without computing them. There might be specifically
> >> for a deferred sequence conversion but it would require a fair bit of
> >> effort to figure out that would be better spent elsewhere.
> >>
> >> I've never been a big fan of object.size since what it is trying to
> >> compute isn't very well defined in the context of sharing and possible
> >> internal state changes (even before ALTREP byte code compilation could
> >> change the internals of a function [which object.size sees] and
> >> assigning into environments or evaluating promises can change
> >> environments [which object.size ignores]). The issue is not unlike the
> >> one faced by identical(), which has a bunch of options for the
> >> different ways objects can be identical, and might need even more.
> >>
> >> We could in general have object.size for and ALTREP return the
> >> object.size results of the current internal representation, but that
> >> might not always be appropriate. Again, what object.size is trying to
> >> compute isn't very well defined.
> >>
> >> RStudio does seem to call object.size on every assignment to
> >> .GlobalEnv. That might be worth revisiting.
> >>
> >>
> >> Best,
> >>
> >> luke
> >>
> >>> Hmm.  But still, the idea had been that object.size()  *shuld*
> >>> return the size of the "de-ALTREP'ed" object *but* should not
> >>> de-ALTREP it.
> >>> That's what happens for integers, but indeed fails to happen for
> >>> such as.character(.)ed integers.
> >>>
> >>>  From my eRum presentation (which took from the official ALTREP
> >> documentation
> >>> https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
> >>>
> >>>   > x <- 1:1e15
> >>>   > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not
> >> really
> >>>   8000000000000048 bytes
> >>>   > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
> >>>   [1] FALSE
> >>>   > xs <- sort(x)  #
> >>>   > .Internal(inspect(x))
> >>>   @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
> >>>   >
> >>>
> >>>   > cx <- as.character(x)
> >>>   > .Internal(inspect(cx))
> >>>   @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >>>     @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
> >>>   > system.time( print(object.size(x)), gc=FALSE)
> >>>   8000000000000048 bytes
> >>>      user  system elapsed
> >>>     0.000   0.000   0.001
> >>>   > system.time( print(object.size(cx)), gc=FALSE)
> >>>   Error: cannot allocate vector of size 8388608.0 Gb
> >>>   Timing stopped at: 11.43 0 11.46
> >>>   >
> >>>
> >>> One could consider it a bug that object.size(cx) is indeed
> >>> inspecting every string, i.e., accessing cx[i] for all i.
> >>> Note that it is *not*  deALTREPing cx  itself :
> >>>
> >>>> x <- 1:1e6
> >>>> cx <- as.character(x)
> >>>> .Internal(inspect(cx))
> >>> @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >>>> system.time( print(object.size(cx)), gc=FALSE)
> >>> 64000048 bytes
> >>>    user  system elapsed
> >>>   0.369   0.005   0.374
> >>>> .Internal(inspect(cx))
> >>> @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
> >>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >>>     > Is there a way to avoid the problem of forced
> >>>     > materialization in rstudio?
> >>>
> >>>     > PS: Is there a way to tell if a post has been received by
> >>>     > the mailing list?  How long does it take to show up in the
> >>>     > archives?
> >>>
> >>> [ that (waiting time) distribution is quite right skewed... I'd
> >>>   guess it's median to be less than 10 minutes... but we had
> >>>   artificially delayed it somewhat in the past to fight
> >>>   spammers, and ETH (the hosting instituttion) and others have
> >>>   increased spam and virus filtering so everything has become
> >>>   quite a bit slower ]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >> --
> >> Luke Tierney
> >> Ralph E. Wareham Professor of Mathematical Sciences
> >> University of Iowa                  Phone:             319-335-3386
> >> Department of Statistics and        Fax:               319-335-3017
> >>      Actuarial Science
> >> 241 Schaeffer Hall                  email:   [hidden email]
> >> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel