Object and file sizes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Object and file sizes

Göran Broström-3
Hello,

I have two large data frames, 'liss' (170 million obs, 8 variables) and
'fobb' (52 million obs, 8 variables, same as for 'liss'), and checking
their sizes I get

 > object.size(liss)
7477492552 bytes
 > object.size(fobb)
2494591736 bytes

Fair enough, but when I save them to disk (saveRDS), the size relation
is reversed: 'fobb.rds' takes up 273 MB while 'liss.rds' uses 146 MB!

I was puzzled by this and thought that I had made a mistake in creating
them, but the only explanation I can find for this is that 'liss'
contains a lot more missing values.

Suggestions?

Thanks, Göran

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Object and file sizes

Duncan Murdoch-2
On 28/06/2019 7:35 a.m., Göran Broström wrote:

> Hello,
>
> I have two large data frames, 'liss' (170 million obs, 8 variables) and
> 'fobb' (52 million obs, 8 variables, same as for 'liss'), and checking
> their sizes I get
>
>   > object.size(liss)
> 7477492552 bytes
>   > object.size(fobb)
> 2494591736 bytes
>
> Fair enough, but when I save them to disk (saveRDS), the size relation
> is reversed: 'fobb.rds' takes up 273 MB while 'liss.rds' uses 146 MB!
>
> I was puzzled by this and thought that I had made a mistake in creating
> them, but the only explanation I can find for this is that 'liss'
> contains a lot more missing values.

saveRDS() uses compression by default.  Compression works best if there
are a lot of repetitive values; every NA is the same, so that would help
  compression.  Other values may also be repeated.

If you use saveRDS(compress=FALSE), you'll get much larger results,
probably roughly proportional to the object.size() results.

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Object and file sizes

Göran Broström-3


On 2019-06-28 15:26, Duncan Murdoch wrote:

> On 28/06/2019 7:35 a.m., Göran Broström wrote:
>> Hello,
>>
>> I have two large data frames, 'liss' (170 million obs, 8 variables) and
>> 'fobb' (52 million obs, 8 variables, same as for 'liss'), and checking
>> their sizes I get
>>
>>   > object.size(liss)
>> 7477492552 bytes
>>   > object.size(fobb)
>> 2494591736 bytes
>>
>> Fair enough, but when I save them to disk (saveRDS), the size relation
>> is reversed: 'fobb.rds' takes up 273 MB while 'liss.rds' uses 146 MB!
>>
>> I was puzzled by this and thought that I had made a mistake in creating
>> them, but the only explanation I can find for this is that 'liss'
>> contains a lot more missing values.
>
> saveRDS() uses compression by default.  Compression works best if there
> are a lot of repetitive values; every NA is the same, so that would help
>   compression.  Other values may also be repeated.
>
> If you use saveRDS(compress=FALSE), you'll get much larger results,
> probably roughly proportional to the object.size() results.

Almost equal to the object.size results: The differences are 2167 bytes
and 2171 bytes, respectively (smaller on disk). Thanks for the explanation!

Göran

>
> Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.