Reviving an old thread. I haven't noticed this be a problem for a while

when saving RDS's which is great. However, I noticed the problem again when

saving `qs` files (

https://github.com/traversc/qs) which is an RDS

replacement with a fast serialization / compression system.

I'd like to get an idea of what change was made within R to address this

issue for `saveRDS`. My thought is that this will help the author of the

`qs` package do something similar. I have had a browse through the release

notes for the last few years (Ctrl-F-ing "environment") and couldn't see it.

Many thanks for any help and best wishes to all.

The following code uses R 3.6.2 and requires you to run

install.packages("qs") first:

save_size_qs <- function (object) {

tf <- tempfile(fileext = ".qs")

on.exit(unlink(tf))

qs::qsave(object, file = tf)

file.size(tf)

}

save_size_rds <- function (object) {

tf <- tempfile(fileext = ".rds")

on.exit(unlink(tf))

saveRDS(object, file = tf)

file.size(tf)

}

normal_lm <- function(){

junk <- 1:1e+08

lm(Sepal.Length ~ Sepal.Width, data = iris)

}

normal_ggplot <- function(){

junk <- 1:1e+08

ggplot2::ggplot()

}

clean_lm <- function () {

junk <- 1:1e+08

# Run the lm in its own environment

env <- new.env(parent = globalenv())

env$subset <- subset

with(env, lm(Sepal.Length ~ Sepal.Width, data = iris))

}

# The qs save size includes the junk but the rds does not

save_size_qs(normal_lm())

#> [1] 848396

save_size_rds(normal_lm())

#> [1] 4163

save_size_qs(normal_ggplot())

#> [1] 857446

save_size_rds(normal_ggplot())

#> [1] 12895

# Both exclude the junk when separating the lm into its own environment

save_size_qs(clean_lm())

#> [1] 6154

save_size_rds(clean_lm())

#> [1] 4255

On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <

[hidden email]> wrote:

> Thanks so much for all this.

>

> The first solution is what I'm going with as I want the terms object to

> come along so that predict still works.

>

> On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel <

>

[hidden email]> wrote:

>

>> Another solution is to only save the parts of the model object that

>> interest you. As long as they don't include the formula (which is

>> what drags along the environment it was created in), you will

>> save space. E.g.,

>>

>> tfun2 <- function(subset) {

>> junk <- 1:1e6

>> list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris,

>> subset=subset)$coef)

>> }

>>

>> saveSize(tfun2(1:4))

>> #[1] 152

>>

>>

>>

>> Bill Dunlap

>> TIBCO Software

>> wdunlap tibco.com

>>

>> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <

[hidden email]>

>> wrote:

>>

>> > One way around this problem is to make a new environment whose

>> > parent environment is .GlobalEnv and which contains only what the

>> > the call to lm() requires and to compute lm() in that environment.

>> E.g.,

>> >

>> > tfun1 <- function (subset)

>> > {

>> > junk <- 1:1e+06

>> > env <- new.env(parent = globalenv())

>> > env$subset <- subset

>> > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset =

>> subset))

>> > }

>> > Then we get

>> > > saveSize(tfun1(1:4)) # see below for def. of saveSize

>> > [1] 910

>> > instead of the 2129743 bytes in the save file when using the naive

>> method.

>> >

>> > saveSize <- function (object) {

>> > tf <- tempfile(fileext = ".RData")

>> > on.exit(unlink(tf))

>> > save(object, file = tf)

>> > file.size(tf)

>> > }

>> >

>> >

>> >

>> > Bill Dunlap

>> > TIBCO Software

>> > wdunlap tibco.com

>> >

>> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <

[hidden email]>

>> wrote:

>> >

>> >> In the below, I generate a model from an environment that isn't

>> >> .GlobalEnv with a large object that is unrelated to the model

>> >> generation. It seems to save the irrelevant object unnecessarily. In

>> >> my actual use case, I am running and saving many models in a loop that

>> >> each use a single large data.frame (that gets collapsed into a small

>> >> data.frame for estimation), so removing it isn't an option.

>> >>

>> >> In the case where the model exists in .GlobalEnv, everything is

>> >> peachy. So replicating whatever happens when saving the model that was

>> >> generated in .GlobalEnv at the return() stage of the function call

>> >> would fix this problem.

>> >>

>> >> I was referred to this list from r-bugs. First time r-devel poster.

>> >>

>> >> Hope this helps,

>> >>

>> >> Kendon

>> >>

>> >> ```

>> >> tmp_fun <- function(x){

>> >> iris_big <- lapply(1:10000, function(x) iris)

>> >> lm(Sepal.Length ~ Sepal.Width, data = iris)

>> >> }

>> >>

>> >> out <- tmp_fun(1)

>> >> object.size(out)

>> >> # 48008

>> >> save(out, file = "tmp.RData", compress = FALSE)

>> >> file.size("tmp.RData")

>> >> # 57196752 - way too big

>> >>

>> >> # Works fine when in .GlobalEnv

>> >> iris_big <- lapply(1:10000, function(x) iris)

>> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris)

>> >>

>> >> object.size(out)

>> >> # 48008

>> >> save(out, file = "tmp.RData", compress = FALSE)

>> >> file.size("tmp.RData")

>> >> # 16641 - good size.

>> >> ```

>> >>

>> >> [[alternative HTML version deleted]]

>> >>

>> >> ______________________________________________

>> >>

[hidden email] mailing list

>> >>

https://stat.ethz.ch/mailman/listinfo/r-devel>> >>

>> >

>> >

>>

>> [[alternative HTML version deleted]]

>>

>> ______________________________________________

>>

[hidden email] mailing list

>>

https://stat.ethz.ch/mailman/listinfo/r-devel>>

>

>

[[alternative HTML version deleted]]

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel